sep 2 9 DF \9Z: 
The Journal of = 


lucational Psychology 


sea Primarily to the Scientific Study of Problems of 
Learning and Teaching. 














BOARD OF EDITORS: 


ROLD ORDWAY RUGG, Chairman. RUDOLF PINTNER, 
incoln School of Teachers College. Teachers College, Columbia University. 
eachers College, Columbia University. 
BEARDSLEY RUML, 


MES CARLETON BELL, Carnegie Corporation, New York City. 
Brooklyn Training School for Teachers. LEWIS MADISON TERMAN, 


ANK NUGENT FREEMAN, Leland Stanford University. 


Iniversity of Chicago. EDWARD LEE THORNDIKE, 


THUR IRVING GATES, Teachers College, Columbia University. 
eachers College, Columbia University. aesiectinislasnie 


AN ALLEN CHARLES HENMON, LAURA ZIRBES Assistant Editor. 
University of Wisconsin. Lincoln School, of Teachers College. 


v. SEPTEMBER, 1923 $4.00 a year. 


600. per copy. 








CONTENTS 


A New METHOD FoR DETERMINING THE SIGNIFICANCE OF Dtrr- 
FERENCES IN INTELLIGENCE AND ACHIEVEMENT SCORES. 
Truman L. Kelley . 


THE ACHIEVEMENT QUOTIENT TECHNIQUE. G. M. ‘Ruch . 


FORMULAS FOR THE CORRELATION BETWEEN Ratios. Karl J. 
Holzinger . 


THE VALIDATION OF —— Pairs. - M. Seotun 


AN EXPERIMENTAL STUDY OF THE RELATIVE DIFFICULTY OF 
TRUE-FALSE, MULTIPLE-CHOICE, AND INCOMPLETE-SEN- 
TENCE Types OF EXAMINATION QueEsTIONsS. H. H. Rem- 
mers, L. E. Marschat, Adelaide Brown, and Isabella Chapman . 


A UntrorM OBJECTIVE EXAMINATION ON INTELLIGENCE TESTING. 
Denton L.Geyer . . .. . : 
COMMUNICATIONS AND Desoussi0ns: 
Notre upon HouziInGer’s FoRMULA FOR THE PROBABLE 
Error. Truman L. Kelley . 
Repty To Proressor KELLEY’s Crrrictsu. ‘Karl - 
Holzinger . 
LEeTTerR From A. A. RoBack. 


New PvusBLicaTions IN EDUCATIONAL Suscueseee AND . 
FIELDS OF EDUCATION . 





> 
t sues o = 
“ a 
— a a ae ° 
- —_— 


Published Monthly Except June to August by 


WARWICK and YORK, Inc., 
York, Pa. Baltimore, Md, 


a wf ‘ _ - - > 
COR eae _- ey 

















~ ee ee 


pnd Class matter November 15, 1921, at the Post Office at York, Pennsylvania, under the Act of March 3, 1879. 


— 


im tase, ae , ee 
estes ea even . — 
ie be RI SA one ss lees igingr gee FOF | 
he ae 


an 
age fo 








The Journal of 
Educational Psychology 


is published monthly, by Warwick & York: 
Inc., Baltimore, Md., and York, Pa. The issues 
for 1922 make Volume XIII. Title page and 
index are bound in December number. 

Manuscripts for publication, books or other 
materials for review, and news items should be 
addressed to Harold Ordway Rugg, Chairman of 
the Editorial Board, 59 Edgecliffe Terrace, 
Yonkers, New York. 

The price of the Journal is $4.00 a year, payable 
in advance. Foreign postage is 30 cents extra. 
Single issues are 60 cents each, and less than a full 
year’s subscription is charged at single issue rate 
for every month ordered. 

Subscribers should notify the publishers of 
change in address at least four weeks before 
publication of the issue with which change is to 
take effect. No claim for non-receipt of an 
issue will be entertained unless made within two 
weeks after receipt of the next succeeding issue. 


Warwick & York. Inc., 
‘Baltimore, Md. 


State Maintenance of } 


Teachers in Training 


BY 
WaLTeR Scott HERTz0G 


Teaching has failed to attract a sufficient sup- 
ply of recruits to make adequate selection and 
training possible. State control of education 
locates the responsibility for teacher training and 
for the equalization of educational opportunity 


upon the state. The policy of training teachers. | 


after they enter the service as a substitute for 
institutional preparation has not been successf 
especially in the open country. The unequa 
salaries paid in the rural schools and in the cities 
have drawn the better teachers to the centres of 
population. The public attitude toward teach- 
ing has reduced the attendance in teacher train- 
ing institutions. To improve the attractiveness 
of teaching as a profession, the state must use all 
the plans that experience has proved to be help- 
ful. Additional efforts, y of an economic 
and social nature, must made if the rural 
schools are to be taught by competent teachers. 
The purpose of this present study is to inv: te 
the principles, problems, and practices involved 
in a system of subsidies for prospective teachers 
as one method of recruiting the profession. 


144 pages. $1.60. Postage 8c. 


Warwick & York, Inc., 
Baltimore, Md. 





TEACHERS’ 
HEADQUARTERS | 








EDWARD DAVIS 


Manager 


HOTEL RENNERT 


BALTIMORE, MARYLAND 


Tew Words New Work 
thousands of them spelled, 
1d defined 1 


pronounced,an din 


S 
INTERNATIONAL DICTIONARY 



















agit n® (Aw eat: Get the 


a Few Samples: %. 
Esthonia sippio Ruthene ¥ 

aerograph askari broadcast 
Blue Cross cyper agrimotor 
rotogravure  stellite Devil Dog ¥ 
Air Council _ sterol hot pursuit ‘ 
mystery ship taiga abreaction 
capital ship sokol activation 
affectivity Swaraj photostat 
mud gun realtor overhead 
megabar soviet 


Is this Storehouse 
of Information 














Gazetteer and Biographical Dictionary 


Write for a sam of the New Words, specimen of 
Regular and India > agg TK 


G. & C. MERRIAM CO., Springfield, Mass., U.S A. 














— 
— 


the 
ost 





en of 


S.A. 


THE JOURNAL OF 
EDUCATIONAL PSYCHOLOGY 


Volume XIV September, 1923 Number 6 














A NEW METHOD FOR DETERMINING THE SIGNIFI- 
CANCE OF DIFFERENCES IN INTELLIGENCE 
AND ACHIEVEMENT SCORES 


TRUMAN L. KELLEY 
Stanford University 


The satisfaction that attaches to such a judgment as “Joe is a 
bright boy, but his school work is not up to his ability,” or ‘‘ John is 
more capable in science than he is in language study” is probably due 
to the very great momentousness of the judgment. Father, mother, 
John himself, has a keen desire to know the likelihood of future suc- 
cess in various callings and the demand for light is so insistent that 
it is little wonder that not infrequently otherwise reputable teachers 
and school principals proffer advice when they are really incompetent 
to do so. 

Without referring to the aggravation of the problem caused by 
charlatans and human nature fakers, there has been a real difficulty 
in it on account of the lack of an experimental means of checking 
up one’s judgments of the differences in the mental abilities of an 
individual. 

There has been a resort in recent years to an appraisal of scholastic 
success and promise by means of the Accomplishment Quotient. A 
child’s pedagogical age, determined in a very fallible way (by class 
marks or scores in a school test) is divided by his mental age, likewise 
determined by fallible means (a group or individual intelligence test) 
and this quotient is taken as the ratio of what the child accomplishes 
to what he would have accomplished had he put forward just average 
effort. 

A reliable judgment of this sort would serve many purposes. 
It is therefore of importance that the interpretation of such a quotient 
be made only in the light of its probable error. Toops and Symonds 

321 











SE tO BAAS. 


. 
i REE OR 


‘ 
% 
be : 
, 





322 The Journal of Educational Psychology 


have discussed the merits of the accomplishment quotient! and 
Chapman has shown? the unreliability of measures of difference 
between educational and intelligence test scores. Though I find my- 
self in the main in agreement with Chapman’s position I do not consider 
the difficulty of making judgments of difference as great as he pictures 
it, and for two reasons, one theoretical and the other practical. First, 
he judges of the excellence of one fallible difference by comparing it 
with a second fallible difference, whereas he would have obtained a 
truer idea of the significance of his fallible measure had he compared it 
with a true difference; second, the illustrations he gives are not well 
chosen as the functions or tests involved are not as disparate as are 
other functions for which we can readily obtain fairly reliable scores. 

The logical approach to this problem would be to ascertain by 
experimental and statistical analysis disparate mental functions in 
mankind, or at least in childhood, and then devise tests to measure 
these functions separately. Such an approach would utilize the best 
of our fallible mental measures and endeavor to determine the degree 
of community or disparity between the traits involved provided they 
were measured perfectly, and would seize upon any found to be disparate 
when so measured. Total disparity between two traits means, of 
course, that a given amount of the one trait presages nothing as to the 
the probable amount of the other. The finding of these traits would 
constitute the foundation for the important practical problems of 
differentiation of abilities. 

It is obvious that two intrinsically disparate traits, if measured in 
a very unreliable manner, will not permit of reliable judgments of the 
sort ‘“‘John is more capable in the first than in the second.”’ On the 
other hand if two so-called traits are intrinsically the same then 
inaccurate measures of them will proclaim differences which in fact do 
not exist. Thus errors in our measuring devices cloud real differences 
and bring forward spurious ones. The subject matter of this article 
is the validity of differentiation of abilities upon the basis of scores 
obtained in ordinary achievement tests and not the more funda- 
mental and logically antecedent one of discovery of disparate traits. 

We will not deal with quotients of measures, but with differences. 
Let there be two tests, say arithmetic and spelling, resulting in scores 
X, and X» with means for the group studied M, and M2 and standard 
deviations o; and oz. We cannot immediately interpret a difference 
(X, — Xe) in a child’s arithmetic and spelling test scores because the 


1 This journal, Dec., 1922 and Jan., 1923. 
2 This journal, Feb., 1923. 





uni 


jud 
firs 
by 

sco 
sen 








Wa a J ’ \w on 


wee —_—- cr ae cr ww 


ee ee TO RPh OY SS OF SCF OF Se ON 8 


an Oo nO - OO ODO m= 


. . 


Oo DO mm DM 


Differences in Intelligence and Achievement Scores 323 


units of measurement are very different. Thus if a fifth grade pupil, 
A, secures scores of 25 and 80 upon the two tests we cannot forthwith 
judge of his relative standing in arithmetic and spelling. We must 
first express these scores as deviations from the grade means divided 
by the standard deviations. Such derived scores we will call standard 
scores, because their standard deviations are equal to 1.0, and repre- 
sent by the symbols z; and z.. Thus 


on 21 se A 


C1 d2 


21 














(1) 


If M, = 15; Mz = 70301 = 5;02 = 20, we have for the example given, 
2, = (25 — 15)/5 = 2.0, and z, = (80 — 70)/20 = .5. Accordingly 
A, being 2.0 standard deviations above the grade mean in arithmetic 
and .5 standard deviations above in spelling, is 1.5 standard deviations 
better in arithmetic than in spelling. Let us call such a difference d. 
Thus 
d = 2, — 2 (2) 

Is this difference significant or is it a chance difference, due to the fact 
that we have inaccurate measures of both arithmetic and spelling 
abilities? 

Let us assume a second individual, B, obtaining a d = — 1.0, 7.e. 
B is one standard deviation better in spelling than in arithmetic. If 
we have N pupils there will be N values of d and the standard deviation 
of these d’s is, by the usual formula for the standard deviation of a 
difference, 


oa = V 621? + O22" = 2r 120210 22 = Yy 2— 2ri2 (3) 
in which riz is the correlation between the arithmetic and spelling 
scores. 

It is important to note that this is the standard deviation of the 
distribution of d’s, but it is not at all the standard error of a single d. 
This may be made clear by considering the difference between height 
and weight, two measures which we can obtain with very high relia- 
bility. Let us say that for two individuals, E and F we obtain two 
height-weight d’s, 1.5 and —1.0 respectively. Were we to measure 
E in both height and weight a number of times we might, due to 
inaccuracies in weighing and measuring height, obtain a series of d’s, 
1.502, 1.498, 1.501, ete. Similarly for B, —1.005, —1.004, —1.000, 
etc. These two series of d’s are not samplings of the same thing. 
Individual A is tall for his weight and individual B is short for his and 
that is the truth of the situation. The differences found are sub- 





* - “? 7 : , 
Rhee yo ot meat « _ 
v 
_—— F 


——e 


ree a ‘ 
RS | 


: = - 
x we 2 


ee ee oe 
— \ me . 3 ry 


pe a3 
wm er 


es ~ 
ee 














UA Sas 2a oO STR eae fp 





= 


a“? a 
fbn pe rege «tele 


niag Saba pets 
OO, .eo 
ilk 
. 


i” 
14 
say 
a 
4 
‘3 
3; 





Pi a PSI 2 


TAS te 


as ; 
J PDE OE CES GE aes 








a aie’ 


. 
: 
ie 
4 
ae 
a 
t 
» 

Psi 


ed 


324 The Journal of Educational Psychology 


stantially accurate so that the standard error of these d’s should be a 
very small amount, let us say .002, and not at all of the order of 
magnitude, roughly .9, given by formula (3). 

What we wish to know is the deviation of the d’s, obtained for 
individual A by successive testings of him, from his true d. Let 
us define the true difference for individual A as equal to his true z, 
score minus his true 22 score, and let us define his true 2; score as his 
average score in arithmetic if tested by means of similar tests and under 
similar conditions an infinite number of times. We will define his 
true Zz: score in the same way except that we are here dealing with 
spelling and will designate these true scores by z. (2 sub infinity) and 
Zz, (2 sub omega). 

A single d (= 2; — 22) obtained for individual A is similar to a 
second one obtained for him by means of additional X, and X-2 meas- 
ures because individual A is the same throughout, 7.e., z.. and z,, have 
not changed. Thus the standard error which we wish is not cq of 
formula (3), but oz..,, that is, the standard deviation in the d’s for 
constant values of z. and z,. The magnitude o,..,, is of the type 
71.23 familiar to students of multiple correlation. Those who are not 
will need to skip the next paragraph giving the evaluation of this 
standard error. In deriving it, certain basic formulas are required. 
The original derivations of these have appeared in various places. I 
have elsewhere! given the derivation of each of them, and as these 


derivations are lengthy they will not be repeated here. The needed 
formulas are: 


(a) Tre = Vru The correlation between a fallible score and 
a true score of the same function is equal to the 
square root of the reliability coefficient. 


(bri, = asd The correlation between a fallible score and 
Vratt a true score of a different function is equal to 
the correlation between the fallible scores in 
the two functions divided by the square root 
of the reliability coefficient of the second 
fallible score. 
(C) o2~-y =Vor + a2” — 2rig102 





221.203.2 


(¢) T.2 = Noi.203.2 


1 Kelley: “Statistical Method.” Macmillan, 1923. 








I 
r 
i 
: 
f 








be a 
ar of 


| for 
Let 
le 2; 
3 his 
nder 
» his 
with 
and 


to a 
eas- 
lave 
‘a of 
; for 
ype 
not 
this 
red. 


hese 


ded 


and 
the 


and 
1 to 
3 in 
‘oot 
ond 


Differences in Intelligence and Achievement Scores 325 


(f) 21.2 is a residual, that is, it is that part of x, which is independent 
of Z2 





—_ - then o, = 1.0 


With these transformations available we may derive o,, «., 
2 (21.0% ie 22,0)” 


(g) In case z = 





C00 = N = 61,067 + O2,04° — 2912, 007 1.02.00 
By (c) (4) 
C1.00° = O4,°(1 — rie*)(1 — rie.g*) = 1(1 — ry)(1 — 0) = 1 — ry 
By (9), (a), (e), () (5) 
The correlation 7;,,.. is readily shown to be equal to zero: 
_ 221,0%.2 
ee = N61.00o.0 (6) 


Now 21.. is the residual in z,; for a constant 2.q, 7.€. 21. is that part 
of z; which is unrelated to z.. Since however z. is the true score 
in the function and 2)... is that part of the obtained score which is 
unrelated to the true score it is simply the chance element in the 2; 
score. Being thus merely a chance factor it will have a true corre- 
lation with no other measure whatsoever and therefore not with 
Z,.0, 80 we obtain r;,.. = 0. This same result may be obtained at 
the expense of more labor and less logic by expanding r1,,. by the 
usual multiple correlation formula for three variables and making the 
necessary substitutions. The substitutions required are (a), (b), (d), 
(e), (f), (g). A similar derivation gives 


Saas” =l]-—- Toll (7) 


Finally, ri2.., is the correlation between two chance factors—the 

residuals in 2; independent of true values of z.. and z, and the residuals 

in, 22 independent of them—and it therefore equals zero. This result 

may likewise be obtained by using the usual multiple correlation 

formula for four variables, expanding, substituting and simplifying. 
We thus obtain the important and very simple formula 


T4.0% = V2 = fe “7a (8) 
This formula fills a long felt need since it makes possible the deter- 


mination of the probable errors of our judgments of difference of 
abilities within the individual. ; 


PE (of individual z; — 22) = .6745°/2 — ry — ren (9) 











Be fs ee ee ee 


Lp eere ee St 


qn eaten? + 





! 
: 
; 
* 
+s 
u 
a 
r 


326 The Journal of Educational Psychology 


The accompanying table of means, standard deviations, and relia- 
bility coefficients for the separate tests in the Stanford Achievement 
battery were obtained from four Palo Alto eighth grade classes. 
The population is 96. This table provides us with the constants 


needed in obtaining standard scores (z scores) and the probable errors 
of the differences in standard scores 








TaBLeE I 

Test ia Standard Reliability 

deviation coefficient 
Arithmetic computation............. 36.0 3.94 .669 
Arithmetic reasoning............... 27.8 5.21 .825 
CO Ee 63.8 7.20 . 809 
SEA LOPE TEES PE 66.2 10.6 .944 
Sentence meaning.................. 64.2 10.8 . 806 
Paragraph meaning................. 45.4 6.25 . 848 
Weighted reading total............. 222.2 30.4 .943 
ee 38.1 8.90 .729 
NE aii a, a p45 «ie amee 170.9 19.2 .910 
Science information................ 73.3 10.8 .772 
History and literature information. . . 63.8 18.1 .914 
Weighted grand total............... 825.8 88.1 .954 














The test records of four pupils are recorded in the following table. 
Pupils A and B are of opposite type and show extreme inequality in 
achievement record; pupil C is typical of the bright symmetrically 
developed child, and pupil D has the lowest total score, shows marked 
inequality in ability and being nearly seventeen years old is likely to 
be in need of educational and vocational counsel. 








| 


a er Pr eet at et OO te DS DP 





slia- 
lent 
Ses, 
ints 
rors 


ity 
nt 


le. 











Differences in Intelligence and Achievement Scores 327 
TaB.e II 
Scores of pupils 
Test 
A B Cc D 
Arithmetic computation... .. 30 36 40 39 
Arithmetic reasoning........ 24 32 33 11 
Arithmetic total............ 54 68 73 50 
Word meaning............. 78 52 74 36 
Sentence meaning........... 68 56 67 25 
Paragraph meaning......... 51 40 51 23 
Weighted reading total...... 248 188 243 107 
Language usage............ 46 26 48 20 
NE 6 asd died es o cadet 144 158 204 170 
Science information......... 76 73 81 59 
History and literature infor- 
EES S ders i a ate 76 72 80 9 
Weighted grand total....... 806 789 948 565 

















Expressing the scores of the same pupils as standard scores we obtain 











Table ITI. 
TaB1eE III 
Standard scores of pupils 
Test 
A B Cc D 

Arithmetic computation............. —1.5 .0 1.0 8 
Arithmetic reasoning................ — .7 8 1.0 —3.2 
Arithmetic total.................... —1.4 6 1.3 -—1.9 
hak is 6 waka ye eeenns 1.1 —1.3 8 —2.8 
Sentence meaning................... 4 — .8 3 —3.7 
Paragraph meaning................. 9 — 9 9 —3.6 
Weighted reading total.............. .8 —1.1 Rm —3.8 
Language usage...................- 9 —1.4 ee —2.0 
re an sa oak ated eee —1.4 — .7 1.7 .0 
Science information................. 2 .0 m -1.3 
History and literature information. ... A i 5 9 —3.0 
Weighted grand total............... — .2 — .4 1.4 —3.0 

















: ¥, eer See ‘ 
2 ee Re a eB te 


POS Mw omen ates PR AS 


ne 
3 @ SRR 
a Dl : — 
we eee > 
. | 


a > ERNE TSE A Se 5 eS re sre oe 
sex 2 lore 


- 
5 ; 
y : 
ue 
. ¥ 
-_ 
” 
U a 
_" > 
i] *. 
2 - 
a> a 
Ff 
+ 
4 - 
e 
Y 


3 


Ser 


-_~ ~~ 
th hn wh 


Se salina wet” es 


~ ras . * yin 
a at et 
we ~~ 2-8. : a 


3 -- 
at ee 

pe Oe ee Ee ee 
= = : 


* yn — 


7 
ong *~ 





es re 


eet 


Ag ee a OS ame eT ef 


ss : 
net LEED 105 


2 
& 
| 


328 [The Journal of Educational Psychology 


Pupil D is markedly weak in arithmetic reasoning, reading, and 
history and literature information; slightly better in science infor- 
mation and much better in spelling and arithmetic computation. With 
such other items as are readily available (sex, disciplinary record, 
expressed aspirations, etc.) the kind of advice to give this pupil would 
be obvious, provided we can trust these differences. Let us therefore 
calculate, by formula (9), the important probable errors: 

Difference: Arithmetic computation — Arithmetic reasoning = 
4.0 standard deviations. PE of this difference = .5. 

Difference: Spelling — Reading total = 3.8 standard deviations. 
PE of this difference = .3. 

Difference: Science information — History and literature infor- 
mation = 1.7 standard deviations. PE of this difference = .4. 

These probable errors are so small with reference to the differences 
involved that the existence of very important differences may be taken 
as definitely established. This child, as the result of original nature or 
training, it probably matters not which, approaches the end of school 
and the seventeenth birthday with a definite interest and mental bias. 
Is it not to be expected that the child’s future weal or woe lies in the 
use or disuse of this bias in vocational and recreational life? 

Few cases show as pronounced idiosyncracy as this pupil D, but 
two-thirds of the pupils, as will be shown, reveal differences in excess of 
the chance differences due to the fallible nature of the measures used. 

Pupil C, age 12, is probably a great comfort to parents and 
teachers; always doing well and the cause of no perplexity unless to 
the counselor who is asked to advise a specific vocation or course of 
study. The only difference which is quite certainly significant is that 
between Sentence Meaning and Spelling (difference = 1.4; PE of 
difference = .4). Advice to pay more attention to the meaning of 
sentences and less to the spelling of the words in them is suggested, 
but it is quite unnecessary as the child will certainly do this any way 
in the higher school grades. A little extra spelling ability at this age 
and grade is no serious cause for worry. 

Pupils A and B are two types which are found sufficiently often 
among these 96 pupils to raise the question whether there are not in 
truth ‘‘multiple types” such as Thorndike,! utilizing such experimental 
data as the tests of 10 years ago made possible, found no evidence to 
support. Thorndike has long held that there is unevenness in develop- 
ment, but it now seems quite possible that the unevenness will be 


found not to be random but to fall into types. For pupil A we have 


1 “Educational Psychology,” Vol. III, Chap. XVI. 





—~ —-~_ - —< Oo tt 2 


om 





and 
for- 
Jith 
ord, 
uld 
‘ore 


sy 


ns. 


ces 
cen 
or 
00! 
as. 


ut 
of 


Differences in Intelligence and Achievement Scores 329 


Difference: Arithmetic total — Reading total = —2.2 standard 
deviations. PE of this difference = .3. 

Difference: Reading total — Spelling = 2.2 Standard deviations. 
PE of this difference = .3. 

Here again the differences are indubitably significant. For pupil 
B we have: 

Difference: Arithmetic total — Reading total = 1.7 standard devi- 
ations. PE of this difference = .3, 

Difference: Arithmetic total — Spelling = 1.3 standard deviations. 
PE of this difference = .4. 

The search for the meaning in terms of human nature and educa- 
tional psychology of such differences is very enticing, but let us first 
endeavor to ascertain the types of differences which are most apparent 
by the aid of the measuring devices represented by the Stanford 
Achievement tests. The frequency with which differences of various 
sorts are revealed by these fallible tests will not exactly parallel the 
importance of the differences in the natures of the pupils studied, 
because the tests are not equally reliable. Thus, if given four traits, 
a, b, c, d, such that children intrinsically vary as much in the difference 
a — b as in that c — d and given further measures of a and b which 
are more reliable than those of c and d, then we will be able to discover 
and determine differences a — b more often than differences c — d. 
We must therefore keep in mind that the ease with which differences 
are discovered by the aid of the Stanford Achievement Tests depends 
upon: First, the extent to which the individuals differ within themselves 
and second, upon the reliability of the tests. With this in mind let us 
seek to determine the proportion of cases in which the difference 
Z, — Ze (e.g. Computation-Arithmetic reasoning) is so great as to be 
significant. 

The standard deviation of such a difference for a given pupil is 





Fd.ow = V2—9u— ren 
If the distribution of differences for the entire population should have 
the same standard deviation as this, then, obviously, the obtained 
differences are no greater than chance indicates. However, the 
standard deviation for the group, of obtained differences is 


0g = /2 —_ 2rie 
and this standard deviation is greater than the former for every com- 
bination by two’s of the tests in the Stanford battery. The type 
situation is pictured in the accompanying figure. The dotted curve 
is a distribution of standard deviation = oz.,, and the full line 


F 2 ee sk = 
a ee a a 
. - » - a 








B cmp te: ott REPEL De 


“er 


~ tty ath Ae einpmtarae einen gine. PME SS OTP 





330 The Journal of Educational Psychology 


curve is a distribution of the same total area of standard deviation = 
oq. Should the full line curve coincide with the dotted line curve 
then the differences found would, on the whole, be no greater than 
chance suggests, but if the full line curve has the greater standard 
deviation then, in the proportion of cases represented by the shaded 
area the obtained differences are greater than chance suggests. The 


Pe 
7 








figure drawn represents substantially that for Computation and 
Arithmetic reasoning. The shaded area is 26 per cent of the total so 
that approximately one-quarter of the pupils showed differences in 
computation and arithmetic reasoning abilities, when measured by 
the two Stanford Achievements tests, greater than chance. The 
proportion represented by the shaded area depends upon the ratio of 
the standard deviations, oz., and og To obtain this proportion 
knowing the ratio, Table IV is given. 








TaBLe IV 
Proportion of differ- Proportion of differ- Proportion of differ- 
Td.cw ences in excess of | %7d.~w | ences in excess of | %d.~w | ences in excess of 
the chance propor- the chance propor- _. | the chance propor- 
Od | tion Od tion Od tion 
.02 .950 .35 .467 .70 .171 
.05 .888 .40 .415 .75 .138 
.10 .798 45 . 367 .80 .108 
15 .719 .50 .323 .85 .078 
.20 .647 .55 .281 .90 .051 
.25 . 582 _ 60 . 242 .95 .025 
.30 .522 .65 . 205 .99 .005 




















For these four sections of eighth grade pupils the correlations 
between the various tests in the Stanford Achievement battery are as 
given in Table V. These intercorrelations permit the calculation of 
oa, taking the tests in pairs. 

Having the intercorrelations of Table V and the reliability coeffi- 
cients of Table I we can calculate the ratio o, ..,/oa for every pair of 
tests and obtain by Table IV the percentage of pupils showing differ- 











! 
\ 


eS ee a ae ed 2 














ve 
un 
rd 


> 
Y 


he 


Differences in Intelligence and Achievement Scores 331 


TABLE V.—INTERCORRELATIONS BETWEEN THE TESTS IN THE STANFORD ACHIEVE- 
MENT BATTERY 

















o a 3 ; 
S lewis | 3 hat. ¢ 
; a 8 e | e2|e2| ¥ | 8, # | 38 
BE EEE ELIE GEE 
S |4*|S*| Be | "| a5 3 

Arithmetic reasoning....... . 224 

Word meaning............. .044| .399 

Sentence meaning.......... — .026)]. 458)... .|.738 

Paragraph meaning........ . 198.484)... .|.743).646 

one ov on cu seks wehan ....|-410 

Language usage............ .015} . 266) . 201) . 520). 500) .619) .613 

Spelling...................| .829).167|.300) .472|.284) .516) .477) .366 

Science information........ . 178) . 568} . 508) . 532) . 464) . 559) .579) .322) .342 

History and literature infor- 

Sie UW exe owe verge . 120} .507) . 432) .674| .620} .630) . 714! .439) .385).764 



































ences in ability sufficiently great that they cannot be attributed to 
chance. Table VI gives these results in the case of the ninety-six 
eighth grade pupils. 


TaBLE VI.—PERCENTAGE OF DIFFERENCES IN INDIVIDUAL Test Scores IN 
Excess OF THE CHANCE PERCENTAGE 





























=] E; be 
3 =f 5 : of ay ° | r 
2|6s|25| = | 82| $21 2| 23| 2 | 28 
8 \/281/25| 5 | 8] $8] $3| 23| 3 183 
S& |S) 25) & | g*) £8) 22) 9°| 2 8 
| 
Arithmetic reasoning......... 26 | 
Word meaning...............| 37 | 38 
Sentence meaning............| 33 | 33 | .. | 18 
Paragraph meaning.......... 28 | 27 | .. | 22 | 17 
Reading total............... rn ne 
Language usage.............. 28 | 28 | 29 | 26 | 18 | 14 | 20 
re 27 | 41 | 37 | 44 | 37 | 32 | 44 | 29 
Science information.......... 26 | 18 | 20 | 28 | 22 | 20 | 26 | 24} 33 
History and literature infor- 
IG AG iebwes «i 0och et 33 | 31 | 33 | 35 | 24 | 27 | 31 | 27 | 33 | 10 






































Very interesting evidence of lack of relationship between certain 
scholastic achievements is found in this table. The most striking 
feature of the table is that there are no two functions which do not 
show substantial disparity. The two which can be differentiated 


a 


mee ey 





xt Ft 
i aeeiaie 


ee ee ee ereseqgpts fos 


oe: > 


No al ill 
* 


te SE EES ERAS 


— 
ona eee addin eay ae. diese - aetae 
sine ee oe : 2 


—_ one a 
ae 4 ae. Sate 4 
—s ceed - 


Hi 
a; 


332 The Journal of Educational Psychology 


least readily are Science Information and History and Literature 
Information. Of all the tests these are the least definitely connected 
with the subjects in the elementary school curriculum. 

It is very interesting to note that the knowledge of the meaning of 
words and of the spelling of words do not develop in a parallel manner, 
for no less than 44 per cent of these eighth grade children show measur- 
able differences in relative ability in these two subjects. On the 
other hand Language Usage and Paragraph Meaning are more closely 
allied as but 14 per cent show significant inequality in development in 
these two subjects. These observations apply merely to the differences 
in ability as revealed by the tests and not to intrinsic differences 
within the children. We may, however, infer from the data at hand 
what the percentages of differences in excess of chance would be 
were all the tests equally reliable. 

The mean reliability coefficient, weighting all the tests equally 
and omitting Arithmetic Total, Reading Total, and Grand Total 
reliabilities, is .824. The correlation between Computation and 
Arithmetic Reasoning is .224, but if the two tests were perfectly reliable 
this correlation would be 

T12 


rT = — 
ria Vu V ren 


in which r,,, is to be read ‘‘the estimated correlation between true 
scores in the first and second traits.’”’ It is the correlation corrected 
for attenuation by Spearman’s formula. Reversing the calculation, if 
the two tests, instead of being perfectly reliable, each have reliabilities 
.824, then 712 would equal 


T wwV 824 V/.824 = (.302)(.824) = .248 


Under these conditions, 7.e. having Computation and Arithmetic 
Reasoning tests of reliability .824, we would have 


Cg.04/Tq = V2 — 824 — .824 / ~/2 — 2(.248) = .484 


Entering Table IV with this value we obtain .337, the proportion of 
differences in excess of chance between Computation and Arithmetic 
Reasoning provided both tests have reliability .824. Making similar 
calculations for the other pairs of tests yields the values of Table VII. 

Whereas we can, with the exercise of some care, obtain a spelling 
test of reliability .91 it is with great difficulty that we can obtain a 
computation or language usage test of reliability .75. Accordingly a 
situation in which all the tests have equal and high reliability is not 
likely to be obtained objectively, but if attained we should expect 
results about as indicated in Table VII. 





= .302 

















TA 
CE! 


ee © 


a2 TH O 


- ~~ —C=>. ff cot oh 





= ote CY 


We 


\e 


Differences in Intelligence and Achievement Scores 333 


TaBLE VII.—PERcENTAGE OF DIFFERENCES IN INDIVIDUAL TgsT Scores In Ex- 
CESS OF THE CHANCE PERCENTAGE IN CASE THE RELIABILITY OF Eacu TEsT 18.824 














g : ‘ 
S oS a 
3 gs a | 8 a sg o. ¢ og 
ALI UEIEIEALL 
5 & B £ . a . mM 3 
ae 

Arithmetic reasoning.................. 34 | 

We 55 alc ok. Hw 39 | 30 

Sentence meaning.....................| 40 | 26 | 13 

Paragraph meaning................+.- 35 | 26 | 14 | 17 

Language usage.......................| 39 | 33 | 24 | 23 | 17 

a Wb bbb id vice caste ceeReaeN 31 | 36 | 28 | 33 | 25 | 30 

Science information................... 35 | 20 | 24 | 25 | 21 | 30} 31 

History and literature information...... 37 | 26 | 20 | 20 ; 20 | 27; 31); 9 





























From this table we learn that Computation is more similar to 
Spelling (31 per cent of differences in excess of chances) than to any 
other of the tests, not excepting Arithmetic Reasoning, and that it 
is most different from Sentence Meaning (40 per cent of differences in 
excess of chance). Arithmetic Reasoning is most closely allied to 
Science Information and least to Spelling. Word Meaning, Sentence 
Meaning and Paragraph Meaning are mutually related. Each is also 
quite similar to History and Literature information and quite dissimilar 
to Computation. Language Usage is most similar to Paragraph 
Meaning and least to Arithmetic Reasoning. Science Information 
shows the strongest bond with History and Literature Information, 
and the weakest with Computation. History and Literature Infor- 
mation, in addition to its similarity to Science Information, is related 
with Reading and inversely with Computation. 

The probable errors of these percentages of differences in excess of 
chance are not known but are probably not large as the results point 
uniformly to certain types of relationship and do not wabble around 
as would chance results. Repeated and further investigations are of 
course necessary, but the indications are that eighth grade children are in 
fact “real persons’’ with specific and unique mental mechanisms. The 
charge, sometimes made, that standardized educational processes have 
killed all individuality is not borne out by the findings. When using 
the Stanford Achievement Tests two-thirds of the children reveal one 
or more significant inequalities in levels of attainment in different 
subjects and there are undoubtedly still other important mental 
differences not as yet revealed. 


Hires RE iS woe 


Am ee 


es 


Tatars < ow 


 — 


RR net a ae hme 


7 
f 
\ 
i ' i 
a 
: 
bas 1 
\ is 
vk. 
s! f 
y a. 
+: 


§ 
ef 
i 
if 


Ree ee ey ee wb 
eo wu? F + 


a 
tae 


- 


vi 


[ 


Sas fs 





i 


THE ACHIEVEMENT QUOTIENT TECHNIQUE 
G. M. RUCH 


University of Iowa 


Introduction—The very welcome critical article by Toops and 
Symonds! which recently appeared in this journal has gone a consider- 
able way in the clearing up of the fundamental issues in the AQ 
technique, although, as the authors point out, it has raised many more 
problems than it has attempted to settle. The present writer is partic- 
ularly grateful to these authors since it enables him to plunge more 
directly into the discussion of certain other issues. 

There are today apparently three batteries of tests, in addition to 
the ones adapted by Franzen for his purposes, which cover two or 
more of the most important elementary school subjects. These are 
the Illinois Examination? by Monroe and Buckingham, the Lippincott- 
Chapman Classroom Products Survey Tests* by Chapman, and the 
Stanford Achievement Test* by Kelley, Ruch and Terman. The 
first includes scales or the measurement of reading, arithmetic and 
general intelligence. The second measures only arithmetic and read- 
ing. The last includes reading, arithmetic, spelling, language and 
grammar, geography and elementary school science, and history 
and literature (six separate subjects and nine tests in all). 

The fullest discussion of the significance of the AQ as an educa- 
tional instrument is given by Franzen in his monograph entitled ‘‘ The 
Accomplishment Ratio.”’® Pintner and Marshall® have also used an 
analogous technique in the attempt to measure motivation in terms of 


1Toops, H. A. and Symonds, P. M.: What Shall We Expect of the AQ? 
Journal of Educational Psychology, Vol. XIII, 1922, pp. 513-528 and Vol. XIV, 
1923, pp. 27-38. 

2 Monroe, W. 8. and Buckingham, B. R.: ‘‘The Illinois Examination, Teachers 
Handbook.”’ Bureau of Research, University of Illinois, 1920. 

’ Chapman, J. C.: “The Lippincott-Chapman Classroom Products Survey 
Tests.”” J. B. Lippincott Co., Philadelphia, 1920-21. 

4 Kelly, T. L., Ruch, G. M. and Terman, L. M.: ‘‘The Stanford Achievement 
Test.”” World Book Co., Yonkers, N. Y., 1923; especially the “‘Manual of 
Directions,” pp. 52-58. 

5 Franzen, R.: ‘The Accomplishment Ratio.’”’ Teachers College Contri- 
butions to Education, No. 125, 1922. 

*Pintner, R. and Marshall, H.: A Combined Mental-educational Survey, 
Journal of Educational Psychology, Vol. XII, 1921, pp. 32-43. 

334 





th 
In 


— ht fo ee ott Ul 





oo 


CO Oe 


' we OOO 


wer SP A 


The Achievement Quotient Technique 335 


the differences between their ‘‘ Educational Index” and the ‘“ Mental 
Index.” 

Nature of the Present Study.—The present study reports compari- 
sons between the educational ages (EA’s) and achievement quotients 
(AA’s) obtained by the use of the Illinois Examination, the Lippin- 
cott-Chapman Classroom Products Tests, and the Standard Achieve- 
ment Tests. All three of these examinations were given to a group of 
about seventy-five VI, VII, and VIII grade pupils in the University 
Elementary School at Iowa City, Iowa. Due to absences, the total 
number of complete papers was reduced to 64, this number being 
constant in all calculations reported here. Because, however, of 
the close classification in effect in this. school, the range of talent in 
the three combined grades is less than that of a single unselected age 
group or possibly even the average public school grade. 

In order to make all comparisons as strictly comparable as possible, 
several modifications of the customary uses of these tests have been 
made. These can be described as follows: 


1. In the Illinois Examination, the intelligence scores have not been used at 
all. In all of the test batteries, for purposes of calculating AQ’s, Stanford 
Binet Mental Ages have been used. Educational age in the Illinois 
Examination means throughout this study the average of the educational 
ages for the two subjects, arithmetic and reading. (Reading age in turn 
is the average of rate and comprehension age scores.) 

2. In the Lippincott-Chapman Test the same procedure for educational ages 
was followed as in the case of the Illinois Examination, i.e. reading and 
arithmetic ages were averaged. 

3. In using the Stanford Achievement Test, two procedures have been followed: 

(a) The average of arithmetic and reading ages have been determined for 
purposes of comparison with the two preceding tests, and, 

(b) The educational ages for the composite scores of the entire battery 
(six subjects or nine separate tests) have been calculated according to 
the regular procedure described in the manual. 


Educational ages (1), (2), and (3a) are therefore quite comparable 
since all are based upon reading and arithmetic alone. Educational 
ages (3b) are useful as a check and criterion since they are based upon a 
130-minute testing of high reliability (about 0.98 for unselected age 
groups. See “‘Manual of Directions,” pp. 15-16). The Binet Mental 
ages as well as the chronological ages are also serviceable as statements 
of the range of talent in this experimental group of 64 subjects. The 
notations in connection with the Standford Achievement Test scores 


Reta © aie wegen ti ahs. SW ale S oh %. 
‘ SF ic® > — : te. Fos wee AA x 


a - 


tats 


|F 
2 . 
} 
ies 
BE 
a 
ay 
, , : 
- . i 
te 





= 


a htape 


~ Spr oO 
i .. ? a 


ie 
" » 


- 
were 
‘ 


ee 








336 The Journal of Educational Psychology 


of (a) and (b) are defined in the above statements and this usage is 
consistent throughout this paper. The AQ is defined (except in 


Table V) by the formula AQ = — where AQ = the achievement 
or accomplishment quotient, EA = educational age, and MA = 
Binet mental age. In the AQ’s of Table V, AQ = a where CA 


= chronological age. 
Table I gives the essential facts about the range of talent employed. 














TABLE I 
Mean years Sigma 
and months | months 
| 
EN, oo vc cece Guba ohndo een 12-10.8 17,0 
eS ee ey | 14- 4.1 24.5 
Educational age: 

ee Rs, dees nin wd p velels sabe | 14- 5.4 28.3 
i, I... oy sce cwbsepheeececeene , (esa 19.3 
5. Stanford Achievement Test (a)................ | 14 4.0 17.1 
6. Stanford Achievement Test (b)..............-. | 14- 0.3 17.4 








Table II states the correlations! found for these separate variables 
of Table I. 





TABLE II 
| 1 Bibs 4 5 6 
a7 REARS fons ty hae le Co a 

ioe | | 
1. Chronological age............. J. = -| 0.092 | 0.158 | 0.203 | 0.290 | 0.153 
i ieee, cm Ree | 0.694 | 0.746 | 0.782 | 0.789 
3. Illinois Examination........... OY. Wroer F wibden 0.765 | 0.772 | 0.814 
4, Lippincott-Chapman.......... Ey eee guevet BS 0.869 | 0.825 
5. Stanford Achievement Test (a).|...| ..... | ceeee | ceeee | seen 0.950 
6. Stanford Achievement Test (0). 

N = 64 




















1 The probable errors are not stated. The number of cases is constant at 64 
throughout. 





thi 





The Achievement Quotient Technique 337 


Table III gives the correlations between the AQ’s and Binet 1Q’s. 











is 

in TaB.e III \ 

' v 

nt i ee ere ane Oth 6p okie —0.308 

E+ PEE EE ee PPE? —0.564 

oe Stanford Achievement Test (a).................0000000- —0.692 

A Stanford Achievement Test (b)..................000005- —0.753 

Table IV gives the medians, means, and standard deviations for 
dd. the AQ’s. 
TaBLe IV 

-" Median Mean Sigma 

4 i tine, OH 99.7 100.9 12.1 
Lippincott-Chapman................ | 93.5 94.8 9.2 
Stanford Achievement Test (a)....... | 99.3 100.7 9.9 
Stanford Achievement Test (b)....... 98.1 98.6 9.2 











Since there are some authors who have recommended the use of the 
chronological age as the denominator in the formula for the calculation 
of the AQ, it is interesting to compare the same measures given in 
Table IV with those obtained by the CA method. This alternative 

es method has at least the one advantage of not being committed to any 
hypotheses about the correlation of educational ability with general 
intelligence. Reference to this point will be made later in the 

i, discussion. 
Table V presents the AQ’s figured from the chronological age base. 























TABLE V 
3 . 
) Median Mean Sigma 
4 hei i 
5 ied 
) Illinois Examination................. 110.7 112.7 20.0 H 
Lippincott-Chapman................ _ 104.4 105.2 15.1 ; 
Stanford Achievement Test (a)....... / 112.5 111.6 13.9 i? 
Stanford Achievement Test (b)....... | 110.4 110.4 14.7 | 
4 iiss pws baw eagekcns tas ius 110.5 112.7 18.7 i 
4 i: 








\ 
\ 


338 The Journal of Educational Psychology 


Discussion of the Results.—By reference to Table I, it will be seen 
that the agreement among the three test batteries is far from perfect 
for the means and standard deviations. It is difficult here to set up a 
valid criterion of the probable truth. One way of providing such a 
criterion is that of accepting the average of the three batteries (Illinois, 
Chapman, and Stanford (6). This gives us approximately 140-0 as 
the probable mean educational or achievement age for the group. 
By this criterion the Stanford Achievement Test appears to be the most 
accurately scaled, the Illinois Examination running about as much too 
high as the Lippincott-Chapman test runs too low. The mean mental 
age might also have some value as acriterionin this connection although 
it cannot be stated with any great degree of certainty whether the mean 
educational age will be found to equal, exceed, or fall short of the mean 
mental age of a group of pupils. The safest assumption is that, under 
ordinary school conditions, the educational ages will fall short of the 
mental ages in the majority of cases in all probability. Whether this 
a priori statement holds or not, the three test batteries will be found to 
occupy the same rank positions on a scale of difficulties as they did by 
the first criterion. At any rate, we might wish a closer agreement in 
the means than that reported since the means are not very greatly 
affected by the phenomena of unreliability and regression which exert 
marked influences on the standard deviations. 

When the standard deviations are considered, there are greater 
differences in evidence. Two influences may be at work here; viz., 
(1) the age norms may be distorted at the extreme ranges due to arbi- 
trary scaling outside the limits of direct experimental determination, 
and (2) the standard deviations are spuriously large to varying degrees 
in the three batteries due to unreliability. In connection with the first 
point mentioned, it should be noted that the Illinois Examination 
norms as published in the manual of instructions provide for achieve- 
ment or educational ages covering a range of from 6-0 to 25-6 years, 
i.e., they are arbitrarily extended about 10 years beyond the upper 
limits of direct experimental determination. This is not necessarily 
a criticism since practically all educational and mental tests have been 
thus extended, but the fact should be emphasized that such extensions 
do magnify enormously the dangers of error from distortion of the test 
scaling in the extreme ranges. It is moreover difficult to interpret 
the meaning of an educational age of 25 years. The Lippincott- 
Chapman scale covers a range from 10-0 to 17-0 years and the Stanford 
Achievement Test permits a range of educational ages of 7-6 to 18-6 





yeal 











Illin 
Lipp 
Stan 
Star 
Bine 


—_—_——— 


anc 
Ex: 
sug 
ind 
to 
tio 





- Fe ee 


“ 


-_— _ — —— a= 


The Achievement Quotient Technique 339 


years, the ages above 14-6 being arbitrary extensions. The actual 
ranges of educational ages by the several methods will throw some light 
on this situation. j 











Range Difference 
Illinois Examination..................--- | 10-6 to 20-10 10-4 
Lippincott-Chapman..................... | 11-2 to 17-10 6-8 
Stanford Achievement Test (a)....... ie | 11-7 to 17-7 6-0 
Stanford Achievement Test (b)............ | 11-6 to 17-4 5-10 
WS WE MI oc sc ie eacksedoceee.. 10-0 to 18-8 8-8 


| | 


The agreement is substantial for the Lippincott-Chapman test 
and the Stanford Achievement Test. The range for the Illinois 
Examination is greater than the logic of the situation would seem to 
suggest. Using mental age range as a criterion, the same conclusion is 
indicated again upon the assumption that eductional ages are likely 
to be found to be less variable than mental ages. The standard devia- 
tions of Table I are in accord likewise. 

Turning to the more important factor of the effects of unreli- 
ability, the standard deviations of Table I present even more striking 
differences among themselves. The variability of the Illinois Exami- 
nation ages is very much greater than is the case with the other two 
batteries, which agree rather closely. The variability of the Illinois 
Examination is much greater than for Binet mental ages. When we 
consider the working time limits of these several tests (using working 
time as a rough index of reliability), the sigmas obtained are quite in 
harmony with the expectations. These time limits are: 





Illinois Examination, reading and arithmetic...................66. 2114 min. 
i 6.6. bE ad dws 0.0 + Had Ee vbw dcceduveses 78 = min. 
Stanford Achievement Test, reading and arithmetic................ 80 min. 
Stanford Achievement Test, composite of all subjects about......... 130 = min. 


That the variabilities are markedly affected by unreliability is shown 
by the formula o true = o obtained ~/r,2, where rj: is the reliability 
of one form of the given test. The standard deviations of all of the 
tests, it follows, are spuriously large, this factor apparently being much 
more serious in the case of the Illinois Examination. The reliabilities 
for the range of talent involved in the present study are not known. 
The manual for the Illinois Examination states the reliability coef- 


Feige tat. 
: 7 — 


SO ES AR STE et re ct ts fs Serre Er 
EE a2 pect - a * eT Ee os z 





He 
' 
UJ 
.i 
71 
, " 
ay 


4 
¥ 
. t 
; t 
7 is 
; j 
‘- 7 
“ " . 
int) 2 
» 
- 
isi 
%> 
: ie 
in 
ay 
54 
4 
j 
, 
5 
. 
7 
,G 
| 
. 


~ ge 


ARN 
+ ” 


ee 
coal 


ne Sus 


ae a = > 








x 

2 

at 

q 
sy 


340 The Journal of Educational Psychology 


ficients of the separate tests as 0.76 for arithmetic, 0.72 for compre- 
hension, and 0.79 for rate, for ranges of talent equal to grades VI, VII, 
and VIII pooled (p. 31). The standard deviations are unfortunately 
not stated but are undoubtedly larger than those found for the group 
under study here. Doctor Chapman, in a personal communication, 
estimated that the reliability of his test would be about 0.80 to 0.90 
for our group. The reliability coefficients of the Stanford Achieve- 
ment Test for unselected age groups are about 0.98 for the composite 
educational ages (‘‘Manual,”’ pp. 15-16). The range of talent 
involved in such unselected age groups is from 22 to 24 months of 
Stanford Achievement Test Educational ages. For the combined 
reading and arithmetic ages, the reliability is probably not far from 
0.96 to 0.97 for the same range of talent. It will be recalled that the 
range of talent in our experimental group was about 17 months in 
terms of the educational ages of the Stanford Test. 

To summarize the foregoing discussion, although it is impossible 
to reduce these differences in variability to a precise numerical state- 
ment, it seems logical to suppose that the differences actually found 
are chiefly matters of unreliability arising from the fact that a very 
brief test (the Illinois Examination) is being compared with two very 
much longer tests (the Lippincott-Chapman and the Standford Achieve- 
ment Test). Ignoring possible differences in the accuracy of the scal- 
ings of these tests, it seems that the main source of differences is to 
be found in the varying degrees of fallibility in the tests compared. 
The question can fairly be raised, in this connection, however, whether a 
twenty minute test can be made sufficiently reliable for the purposes 
of the AQ technique. This point constitutes the defense of the longer 
examination like the Stanford Achievement Test. A further point to 
the same effect is the question whether much meaning can be attached 
to AQ’s ranging greatly beyond the IQ’s yielded by the best intelli- 
gence tests, ¢.g., the Binet Scale. The Illinois Examination provides for 
AQ’s between the limits of 34 and 264. The range found in this study 
was from 76 to 130. At the present time, Binet 1Q’s probably represent 
our best estimate of the range of mental abilities in school children. 

The facts of Table II call for little comment. The correlations of 
achievement ages with Binet mental ages are moderately high but far 
from unity even with due allowance for unreliability. From the 
known behavior of the Binet Tests with respect to unreliability, the 
correlation between Binet mental ages and the educational ages yielded 
by the Stanford Achievement Test could rise to 0.95 or higher instead 





of | 
vari 
resp 


the 
ami 
inte 


v2. 
low 
tiol 
bili 
saf 
ma 
tur 


dif 
de 
mé 
tes 


—rssd& 


mre on 2 © msi & Ff 





= " oe F- @ 


EE i ee ee) ee en 


— = = 


— ees iY tYylUGlUC(<i<iéia |! 


The Achievement Quotient Technique 341 


of less than 0.80 as found. The coefficients actually reported for the 
various examinations are again in harmony with our conclusions with 
respect to comparative reliabilities. 

Table III shows that negative correlation between IQ and AQ is 
the rule as has been found by Franzen and others. The differences 
among the three examinations again fall in line with our former 
interpretations on the basis of unreliability. 

Table IV shows again the same tendencies reported in Table 1; 
viz., that the Lippincott-Chapman AQ’s tend to be systematically 
lower than those of the other two examinations. The standard devia- 
tion of the Illinois Examination shows the same relatively larger varia- 
bility previously mentioned. As a criterion of probable truth, the 
safest assumption here is that the mean and median AQ should approxi- 
mate to 100. Except for the Lippincott-Chapman Test, the depar- 
tures from this estimate of the truth are not great. 

Table V (AQ’s upon a chronological age basis) shows the same 
differences in variability supposedly due to unreliability in varying 
degrees. The central tendency of such AQ’s is for these to approxi- 
mate the mean Binet IQ except in the case of the Lippincott-Chapman 
test. These facts are in harmony with the foregoing tables. 


GENERAL CONSIDERATIONS 


The reader is again referred to the discussion of the AQ technique by 
Toops and Symonds for astatement of the issues and assumptions under- 
lying this method. In the remaining space at the writer’s disposal some 
of these assumptions will be re-examined in the light of our new data. 

1. The AQ procedure involves the assumption that educational 
abilities are correlated to unity with mental age (in the sense of the 
Binet Scale or some other measure of general intelligence), at least 
when pupils are pushed to the limits of improvement (Franzen: 
Loc. cit., pp. 30ff). This, of course, implies due allowances for unre- 
liability of measures either by corrections for attenuation or some other 
analogous procedure. After “pushing” the pupils throughout the year, 
Frazen obtained correlations ranging from 0.70 to 0.85 where the 
estimates that the reliability of his measures would permit correlation 
up to about 0.95, forsuch abilities as vocabulary, arithmetic, reading, and 
completion exercises (pp. 21-23). Without pushing the agreements 
are much less perfect. That such correlations will rise to unity is, 
therefore, a hope rather than a demonstrated fact. In spite of this 
statement, there is undoubtedly a great deal of value in the use of the 


~~ 


CP LT OL ar 


ee 


— 


the. 





a 





i eG DD .. 


‘ qe eee 


Ms 


ETO TT 


TTT EI IE DO eS Tar Re ee ry tt San 


BE a ge err ree, 


EE ns ae - 
“ —- y ~ 
ee ee ee a veces Dippin r= Per 
— ne a 


aa at 
= 


a 





th 
ee 
i2 
é 
$ 
; 
» 
Ki 


4 
q +! 
. 


eh Ry 





342 The Journal of Educational Psychology 


AQ. This value will doubtless be found to vary greatly with the 
particular school subjects, being greatest for reading an arithmetic 
and becoming less and less for grammar, spelling, geography, manual 
training, domestic science, handwriting, etc., in some order as yet 
undetermined. Convincing experimentation must yet be done in 
this connection before we can dogmatize. In view of the imperfec- 
tions of our measures of intelligence and educational ability, the AQ 
is certain to be markedly fallible unless longtime testing is carried out. 

2. The AQ will have a larger probable error than either the numera- 
tor or the denominator in its formula since it, like all quotients, is 
influenced by the unreliability of both terms of its formula. 

3. The value of the AQ will be diminished by faulty grade location. 
Probably not far from two-thirds of all pupils are at any given time 
in the wrong grade in the sense that their educational abilities lie nearer 
the norm of some grade other than the one in which they are actually 
found. (The authors of the Stanford Achievement test found this to 
be the case in four California towns studies.) It is obvious that we 
cannot expect the same educational quotients from two pupils having 
mental ages of 12 years if one is placed in grade IV and the other in 
grade VII, a situation which is not at all uncommon. If, on the con- 
tary, the AQ technique is used to help discover such errors of classi- 
fication, it will have real value. 

4. The adoption of the AQ technique for purposes of motivation 
and measurement of motivation will necessitate the establishment of 
norms of maximal achievement, ‘‘ pushed” norms in Franzen’ termin- 
ology. Franzen’s contention that the AQ does not rise significantly 
above 1.00 is based upon reference to such norms. His statement 
certainly does not hold for norms established on unselected age groups 
with ordinary “balanced” or normal emphasis on the various subjects 
of instruction. The AQ’s for the 64 subjects of this study rise above 
1.00 in about 45 per cent of the cases with the Stanford Achievement 
Test, the mean being very close to unity. The distribution of AQ’s is 
therefore practically normal in this situation. 

5. There are certain arguments to be advanced in favor of AQ’s 
based upon chronological rather than mental age in that this procedure 
is committed to no hypotheses about the correlations of mental traits. 
It is a procedure that is analogous to the concept of the 1Q. The 
most valid objection to this method is that there is little correlation 
between CA and EA in a given school grade, the relationship often 
being negative (Table IT). 








tes 
adi 
fus 
ev 


n*\ > 


— i ea CORlCCOe 





a i ee ee © © 


The Achievement Quotient Technique 343 


6. The, various educational test batteries and standard mental 
tests are so lacking in direct comparability that unless a consistent 
adoption of tests is made and followed, there will result endless con- 
fusion. This is well brought out in the data reported in this study 
even with the consistent use of Binet IQ’s as a base. 


CONCLUSIONS 


1. The comparisons of the three educational test batteries, the 
Illinois Examination, the Lippincott-Chapman Classroom Products 
Survey Tests, and the Stanford Achievement Test, show differences in 
the achievement ages yielded which are significantly large. 

2. Certain evidences were found that the Lippincott-Chapman 
Tests are scaled somewhat too low, resulting in lowered educational 
or achievement ages. 

3. The Illinois Examination appears to be considerably less reliable 
than the other two batteries, although the results are probably as 
good as can be expected in a 214% minute test. The Lippincott- 
Chapman test appears to be more dependable, approaching in accuracy 
the newer and more extensive Stanford Achievement Test. 

4. Study of these three test batteries indicates strongly that 
achievement quotients, to be reliable, must be based on 30 or more 
minutes of testing in the case of most school subjects. The Illinois 
Examination is too brief in our opinion to give entirely satisfactory 
results. The Lippincott-Chapman Tests appear to furnish reliable 
measures for the two subjects of reading and arithmetic. The 
Stanford Achievement Test probably is about as brief a test as is 
consistent with scientific accuracy. It has the further advantage of 
covering six separate fields of elementary school instruction, thus yield- 
ing six separate subject ages as well as a single composite achieve- 
ment age. This composite age, at least, has the required reliability 
for all practical purposes of measurement. (Its probable error equals 
2 months of EA.) 

5. Correlations between mental age and achievement age were 
found to be moderately high for reading and arithmetic although not 
approaching closely to unity in any case. Whether such correlations 
ever can be made unity has not been demonstrated although Franzen 
has shown that this is at least a possibility. 

6. The AQ and the IQ are negatively correlated. 

7. Achievement ages are less variable than Binet mental ages when 
due allowance is made for unreliability. 


ae 
at 
at i 
y 
a" 
M 

















FORMULAS FOR THE CORRELATION BETWEEN 
RATIOS 


KARL J. HOLZINGER 
University of Chicago 


The increasing use of ratios in educational measurement makes 
suitable formulas for their treatment very necessary. Quite fre- 
quently the correlation between two ratios is required. This may be 
obtained by dividing each individual numerator by its denominator 
and then correlating the N pairs of quotients thus found. An alter- 
native procedure is to obtain the required correlation between ratios 
in terms of the correlation between the respective numerators and 
denominators. This method eliminates the 2N divisions necessary 
by the first scheme, and has certain theoretical advantages as well. 


Let = and Ed be the ratios correlated, the capital letters denoting 


original scores or measures (small letters will be used later to denote 
deviations from means). Further, let M,, M,, M, and M, be the 
means of the undivided scores, M, and M, the means of the ratios 


y w 
Z z 
* and WwW’ Q., the correlation between X and Y and V, = Mt’ V,= 
fa 3 qi 
M, ete. | ; 


The arithmetical means and standard deviations of the ratio 


are given in Yule! as follows: 


M 





M, <a VM. (1 wy Dey Vy + V3,) (1) 
. > 
_ Mz [y2,— 20.,V.V, + V? 
s = uN y y y (2) 


1Yule, G. U.: “Introduction to the Theory of Statistics.” Sixth edition, 
Charles Griffin, London, 1922, pp. 215. 


344 


m) > 


—_ Fr) 


— n~e _~ re 





Formulas for Correlation Between Ratios 345 


Since N Q.,y¢2¢, = Zzy, the required correlation, 2, , may be written 


ca 
NQ, wy, = = (; . M.) (Fp ih M,) 
ywyw y an 
= 2(F 7H) ee: 
a | -1 
=i i 21+ a+ a )G+ a) + ae) 
_—NM,M, ' 
“- 


If this last expression is expanded neglecting terms higher than the 
second it reduces upon substituting (1) and (2) to the desired form, 


QeeVeVs + QywVyVw — QeoVeVo — QysVyV 3) 

7” WV (V2, — 292,VeV_ + V2,)(V 2s — 22vViV0 + Vw) 
Six correlation and four means and standard deviations are required 
to work out the value of 2,, from this expression so that it hardly 

yw 

appears shorter than the method of dividing for the N pairs of indivi- 
dual ratios. Sometimes, however, the above correlations are already 
at hand for other purposes and in this case substitution in formula 
(3) is a very simple matter. 

A simplification enters in when the denominators of the ratios are 
the same, e.g. IQ’s by two different tests having identical chronological 
age denominators. Formula (3) then reduces to, 


Q Sa ae Q22VeV2 + Vy mee? Qa Ve Rey QV eV 
22 V(V2s — 2Qe0V Vn + V2w)(V2s — 2QewV Vo + V2w) 
for which only three correlations and three means and standard 


deviations are required. A still further simplification occurs when a 
ratio is correlated with an undivided score. Thus for 2,, we may 














(4) 





w 


substitute Y = 1 in the right hand member of (3) obtaining, 
= QV; oud QuVe 5 
as co VV? 5 2Qw V eV w + V%, ( ) 





2 





1“This result may be shown equivalent to Q(z — y)(z — w) when the means are 
equal. It may be further noted that Professor Chapman’s formula (X) appear- 
ing in the February number of this Journal may only be used when the variables 
are expressed in terms of standard deviations. My attention has also been 
called to the fact that both results are included in an early paper by Pearson.” 





7 ~ 
Coe iee 


Ste Se are ae Se 


‘ J 
ae : 
e- ‘Wh 
‘ 
| 
af 
7 


RF et ge ——s 


RR cee eee pearl ite ae 
be - > 
- 


“ 
RRS PN 








~~ Ape <0 





346 The Journal of Educational Psychology 


This last expression is very useful in determining the validity of a 
: ; ere ; 
given ratio, W when X is the criterion. For example if Z= mental 


age, W = chronological age and X an intelligence criterion, 2,, will 


be high when Q,, is high, Q,. low, and Q,, high. A valid IQ should 
then show high correlation between mental and chronological age, 
high correlation between mental age and criterion, but low correlation 
between chronological age and criterion. These statements are all 
made with the understanding that the correlation between criterion 
and IQ is a measure of the validity of the latter. 

A further use of formula (5) is that of comparing the validity of a 
ratio with that of a simple score. If Q,, is higher than Q,, the ratio 


el would be preferable to the score Z on the basis of validity. 


In case a measure of the reliability of a ratio is desired formula (3) 
may be written in the form, 


Qe 2 = 
yi y2 
(O22, 2) 2 + Ovi oY v "ba QevV 2: Vy, a Qe0V 2 y, y, 
VJ (V2, How 222.4, VV, + ¥*a)V*. — Z2Q2.y2V 2sV vy + v,,) 


Zz . . . . . 
where # and y, are the ratios in successive trials with a test, and 
1 











) 


Q.2,2, and Q,,,, the ordinary reliability coefficients of the undivided 
variables. By suitable choice of origins all of them may be made 
equal giving, 





Qs,2, + Qy, “aon Q., a Qesy, 
2, nq sit - "Ny y y (6)! 
yi y2 Vil tis Q2y,) (1 =a Qesys) 


It is again apparent that high correlation between numerator and 


denominator increases the reliability as well as the validity of a ratio 


xX : pas : 
y Furthermore high reliability coefficients, Q,,,, and Q,,,, and low 





coefficients, 2,,,, and Q,,,, have the same effect. 
Finally if Q2,y, = Qey, = Qey, = Qe, = Qzey, equation (6)’ re- 
duces to, 


a Qeizs + Qa rue a Z Qey 





yi y2 2(1 77 Qsy) (6)" 











Formulas for Correlation Between Ratios 347 


If Qz2,2, = Qyy, = 1, this expression of course becomes equal to 
unity also. In using formula (6)” Q,, may be taken as the average 
of Qey, and Q.,, or Qzy, and Q,,,, and a rough approximation to 


Q.2, 2, thus obtained. For very careful work, however, it is best to 
yiys 


use formulas (6) or (6)’. 











Be ed< 
CSO ARE nage See Ome 


re 


THE VALIDATION OF INTELLIGENCE TESTS 
A. M. JORDAN 


University of Arkansas 


The literature dealing with intelligence tests during the last few 
years has been unusually copious; in fact, each month it constitutes the 
major portion of three or four journals. This material for the most 
part has concerned itself with the invention and application of intel- 
ligence tests. That the results are valid has been taken for granted. 
Owing to (1) the prestige of the names of the makers which was usually 
sufficient to insure the careful statistical treatment of the data used in 
the construction of the tests and to (2) the pressing need for just such 
instruments, many accepted them uncritically. The cordial accep- 
tance by workers of these group tests of intelligence led to the construc- 
tion of several other tests of a similar nature until now there are as 
many as 16 on the market, all claiming to be measures of intelligence. 
There would be no objection to this condition provided that whenever 
a child, or a group of children, is measured on each test closely similar 
results obtained. Unfortunately this is not the case. Wide variations 
are found both in measurements of individuals and in the measurements 
of groups—much greater variation even than is necessary when gener- 
ous allowances have been made for that unknown factor, ‘‘ variations 
in human nature.” ; 

Now as long as there was only one claimant to the papal throne of 
Saint Peter the people were not unwilling to believe that the Pope was 
the one divinely appointed, but when after the ‘‘ Babylonian Captiv- 
ity’’ there were several claimants to this throne the power of the 
Pope suffered because the people began to wonder which one, if any, 
was the real pope. So here, when instruments varying in their results 
claim to measure intelligence it becomes necessary to make a careful 
study of each group test and of each sub-test in order to discover if 
possible which is the best measure of intelligence. A small beginning 
of this work is attempted in this paper. 

The writer is not unmindful of the good work that has been done 
in both the criticism and the validation of intelligence tests. In the 
former case both expert and lay criticism have appeared. Such 
experts as Bridges!® and Stenquist*’** have laid their hands to 
this task. Bridges finding such small correlations with school marks 
in universities and realizing the small chance of a correct prophecy of 
class standing from such a correlation concludes “That their general 

348 








u 
S 
t 
€ 
I 
t 
( 





yf 
iS 


T - 


. 
ip 


i 
if 


RO ~~ OD @D 


— er, 


Validation of Intelligence Tests 349 


use in universities with the object of helping the above mentioned 
administrative and educational problems is absolutely contra- 
indicated ,’’ and again, ‘The results are not only disappointing as to 
their usefulness but may be actually misleading.” Stenquist also 
finds much to criticise in intelligence tests as now constituted. He 
points to the smallness of the correlation with mechanical ability and 
to the discrepant results obtained by individual pupils in several tests. 
Even when these two rather extreme accounts are tempered with the 
more hopeful findings of Colvin and MacPhail!’ in their emphasis 
on the usefulness of intelligence tests for foretelling extreme grades in 
college when the correlation is low, still it seems evident that workers 
in mental tests are becoming more and more critical of these instru- 
ments of measurement. Nor is this criticism confined to the pro- 
fessional students of the subject, for Walter Lippman* after reading 
afew books on the subject criticised intelligence tests in various ways 
pointing out among other things both the lack of agreement in the 
definition of ‘‘intelligence’”’ and the fact that intelligence testers hid 
behind a mathematical and technical smoke screen and concluded that 
intelligence testing is one of the Babu sciences. 

This dissatisfaction with results has led to the studies of the 
validation of intelligence tests. Among these are the studies of 
Breed and Breslich,* Holley,?* Franzen,?! Gates,?5 Root,** and of the 
writer.*. Breed and Breslich made a very thorough study of the 
use of intelligence tests in the classification of pupils. Holley’s 
study is concerned with mental tests for school use. He uses six group 
tests with large numbers of children and computes correlations for 
each of the grades from III to XII. This writer, however, makes 
no attempt to correlate the sub-tests with his criterion of marks nor 
to evaluate them with any other criterion. His was the practical 
problem of determining the most helpful test for school use. He does 
correlate with each other the various sub-tests called by the same name. 
Franzen had more specifically in mind the general evaluation of the 
various tests. He computed the intercorrelations for several tests 
and obtained the correlations again with the factor of reading constant 
by the eminently desirable device of partial correlation. Gates uses 
a composite of educational tests as a criterion with which to correlate 
several group tests of intelligence and the Stanford-Binet individual 
tests. This is one of the most elaborate studies that we have, using as 
it does both partial and multiple correlations and extending from 
grades I to VIII. The results obtained, however, were invalidated for 


oe LoS ATBET 


1g 
| 


8 SS a a OE — 


RTS  . se eee 











P| 
% . 
a 
* 
7 4 





350 The Journal of Educational Psychology 


the single grades since ‘‘There were about 20 pupils to the grade,’ 
which number is too small for reliable correlations. His most signifi- 
cant finding is that ‘Other things being equal the more verbal the 
material the higher the correlation with school attainment.’’ Root 
gave various group tests and the Stanford-Binet to 416 pupils in grades 
I to XII and then correlated each of the group tests with the Stanford- 
Binet. He concludes that “‘The varied character of the Otis Tests 
makes it more valuable in analysis than either the Dearborn or 
Mentimeter alone.” (This means that Otis is best). Finally the 
writer made correlations between four tests of intelligence and of their 
sub-tests (31 in all) with grades. Thus sometimes with school marks, 
sometimes with other tests correlations have been made but few, if 
any, investigations have used several criteria, or included the sub-tests 
in their computations. 

The present study differs from those preceding in several respects 
investigating as it does the following problems: 

I. 1. The correlation of the four group tests (Army Alpha, Terman, 
Miller, and Otis) and the sub-tests of each with the Stanford-Binet 
Tests. | 


2. The correlation of the four group tests and the sub-tests with 
the factor of age. 

3. The correlation of the four group tests and the sub-tests 
with school marks. (This last work was reported in a previous 
article but is summarized and discussed here as a part of the total 
study.) 


4. The correlation of the four group tests and the sub-tests with 
a learning test devised by the writer. 


5. The correlations of four group tests and the sub-tests with a 
composite of four tests. 

II. All these correlations except two were computed again with the 
factor of age partialed out. 

III. Rougher estimates were made of the value of each test by 
comparing it with each of the others to see how many pupils were 
maintained in corresponding thirds and to discover by means of 
assuming r to be perfect and then calculating the differences between 
actual scores and transmuted scores which test was most consistent 
in its results. 

IV. A collection was made of most of the published coefficients 


of correlation computed with these tests and a classification of 
them formed. 








1- 
1e 
ot 
2S 
1- 
is 
or 
le 
ir 


if 
ts 


1€ 


re 


of 


it 


of 


Validation of Intelligence Tests 351 


V. A bibliography of studies, in which the correlation coefficients 
are computed between the tests as a whole or the sub-tests and some 
criteria, was compiled. 

VI. The discovery of the best sub-test or sub-tests of the 31 
was made. 

VII. The discovery of the best group test of the four was made. 

VIII. Discussion and conclusion. 


DESCRIPTION OF METHOD 


Sixty-four pupils of high school age took each of the four group 
tests of intelligence, the Stanford-Binet, the Learning Test, and had 
their intelligence rated by their teachers. These 64 are included in 
each and every correlation computed. Concerning the first criterion 
it is sufficient to say that the tests were given by individuals trained by 
me during a period of three months, and by myself. The responses 
in all cases were taken down verbatim and all were scored by me. In 
obtaining teachers’ estimates of intelligence unusual precautions were 
taken to obtain accurate ratings. In the first place the four teachers 
whose estimates were used were men and women of maturity, critic 
teachers in our training high school, all of them having taken courses in 
psychology. A list of the names of all pupils was issued to them with 
following instructions: 

1. Rate as many pupils as you know well. Zero is lowest; ten is 
highest; five is medium. 

2. You may give fractional scores if you desire. 

3. Rate for general intelligence, by which is meant 

(a) Tendency to take and maintain a definite direction in 
thinking. 
(b) The capacity for making adaptations for the purpose of 
obtaining the desired end. 
(c) The power of self-criticism. 
The average of these four ratings was used as a second criterion of 
intelligence. 

The third criterion is an ideational learning test which may be 
described as follows: Write the letter occurring midway between the 
following letters assuming that each letter has a number according to 
its position in the alphabet. For example, a is 1, b is 2, c is 3, etc. 
The letter midway between 1 and 3 will be b. Then followed 35 pairs of 
numbers such as 4-8, 6-10, 10-14, 13-17, etc. between which the 





ae 


! 


TS 2 a OE et Sa I Oa 


- a Se 


~~ 


a ct tee ARETE 


«= TR 
« oe ae ae 


¥s ere. 
ae eee 








bY ue n 
MOLD 


oI 
By) 
ef 
¥ 
- 


352 The Journal of Educational Psychology 


correct letter was to be located. The pupils had explained to them 
fully what the procedure was and all but two or three comprehended 
the instructions. All understood perfectly at the beginning of the 
second practice period. During one hour, 12 practice periods of three 
minutes each were obtained. The difference between the average of 
the first three and that of the last three was used as the third criterion 
of intelligence, the thought being that if intelligence tests really 
measure capacity to learn, then the amount learned between certain 
times when all were trying should correlate to some extent with the 
tests. 

In addition, correlations were made with age and this factor made 
constant in all correlations by the device of partial correlations. 
I am indebted to Professor Thorndike for this suggestion. Since, 
however, practically all the tests correlated negatively with age the 
effect of this procedure was to reduce slightly (.01 to .06) the correla- 
tions obtained. 

The second part of the program for evaluating the uses of these 
four tests is, concerned in testing the internal consistency of the tests. 
That is, if the tests measure approximately the same mental processes 
then there should be an approximate consistency of ranking among the 
four tests so that if the pupils were divided into thirds by means of 
the scores on one test they should fall approximately into correspond- 
ing thirds when another test was used. Secondly, if we take the 


regression equation x1 = b2 where b = Pes and assume that the 
y 


correlation is perfect, then it is possible to obtain the score in 21 which 
could be expected fromz2. (These suggestions came to me largely from 
Breed and Breslich® in their article in School Review.) The trans- 
mutations have been effected in all the tests and these transmuted 
numbers subtracted from the real scores and the averages of these 
differences computed. This gives a measure of likeness or of unlike- 
ness between the scores of the various tests. Finally a composite was 
made ‘by transmuting by the just mentioned method the scores of 
three tests into one test and taking an average of these transmuted 
scores and the actual score. This average was used as a criterion 
against which to measure each of the group tests. 


RESULTS, 


The following tables set forth the results of the correlations of the 
group tests and the sub-tests with the various criteria. 








'. we § eV 


lh cms mn”! | 


Validation of Intelligence Tests 353 


TaBLE I.—CoRRELATIONS OF THE Four TEsTs AND SUB-TESTS WITH THE STANFORD- 
Brnet Mentat Acg. N = 64 























MA | PE MA | PE MA| PE MA | PE 

Otis 1 -525| .061|| Alpha 1| .243| .079|| Miller 1| .350) .073|| Terman 1 | .492| .064 

2 591! .055 2] .549| .059 2| .497| .063 2| .384| .071 

3 .275| .077 3| .426] .069 3| .482| .065 3 | .572| .057 

4 .568| .057 4] .613] .052|| ...... 4 | .450| .067 

5 .554| .058 5| .502| .059/| ...... he 5 | .471| .065 

6 .482| .065 6| .501| .059)| ...... 5 A RS 6 | .441| .067 

7 .436| .067 7| .470| .066|| ...... 7 | .425| .069 

8 .384| .071 8| .507 “062!| WA ant oe 8 | .437) .068 

9 | .367| .073/| ...... Key RR, BRET, - 9 | .346) .074 

10 .414| .060|| ...... TEE, SS ETS a 10 | .537| .060 

Group test...| .660| .047/| ...... .687| .044|| ...... .530 060 Ae dbl i .680| .045 
iI 1] 



































TaBLE I].—CorRELATIONS OF Four Group TrEsTs AND SUB-TESTS WITH STANFORD- 
Binet Menta AGz, THE Factor or AGE HELD ConsTANT 





\| 
| MA || MA | 
1] 





























| MA | MA 

| 
\} 

Otis 1 -500 | Alpha l .214 Miller 1 .319 | Terman 1 .480 

2 577 2 540 2 485 2| .383 

3 .239 3 .401 || 3 .461 3 | .555 

4 552 eS Gorey 4| .432 

5 541 5 Y 4 Jee eon 5| .469 

6 .460 6 © % Serer seek 6| .420 

7 426 7 ee 7|\ .408 

8 . 365 8 : me rept 8| .428 

9 IEE cchbade Te eee 9} .382 

10 | 890 ! lias ed a er 10! .520 

Group test...........|  .659 ilebawt , a pore ce reese 650 
| i! | 











Alpha-4 (Opposites) stands above all sub-tests in correlation with 
mental age (61). It is higher even than all the sub-tests of the Miller 
Test combined and only slightly behind the other group tests. Rank- 
ing close to Alpha-4 are Otis-2 (.59), Terman-3 (.57), and Otis-4 
(.568). The two highest Alpha-4 and Otis-2 are both opposites as is 
also Terman-3, while Otis-4 is made up of proverbs. When the group 
tests as a whole are considered Alpha, Terman, and Otis are practically 
the same while Miller shows a distinctly lower correlation. 

Table II was deduced from Table I by considering also Table III. 


1} 
4 

it 
le) 
(i= 
if 
h 


- 


a 


— 


heeded Y Sed ew vite allt ne ae STS ae 


i 





: 
| 


- 3 
— 


ad 














354 The Journal of Educational Psychology 


Taste III].—CorrELATIons OF Four Group TESTS AND OF THE SUB-TESTS WITH 




































































AcE. N = 64 
MA | PE | MA | PE MA | PE MA | PE 
Otis 1 |—.182|.081|| Alpha 1 |—.196|.080|| Miller 1 |—.311|.076|| Terman 1 |—.141|.082 
2 |—.245!.079 2 |—.114|.083 2 |—.133!.083 2 |—.038| .084 
3 |—.318|.075 3 | — .287|.077 3 |—.325|.075 3 |—.379|.072 
4 |—.347}.073 4|—.255|.070]| ...... |...... ibaa 4 |—.189|.081 
5 |—.153}.082 5 \—.180).000F ...... |...0.. He.- 5 |— 050} .084 
6 |—.311|.075 6 |—.189|.081|] ...... |...... ied 6 |—.247|.079 
7 |—.102}.083 7 |—.915|.080)| ...... |... bie. 7 |—.174|.081 
8 |—.190].081 S |—.814|.0801| ....... |...00. aa 8 |—.093|.083 
Di SE cscvek Aengaealou aneses [nvesiss e 9 |+.156|.082 
10 |—.256].078|| ...... |...... ie IR, SERS ae 10 |—.247|.079 
Group test...|—.408].070|| ...... — .288|.077|| ...... — .320|.075|| ......... — .244}.079 
| 

r12 — r13 r23 A 

The formula used was 712.3 = 1by which means 


— V1 — 113? V1 — 123? 

the correlation existing between say Otis Group Test and mental age 
may be obtained irrespective of the factor of the relation of each to the 
third factor, age; or, to say it another way, with the factor of age 
constant. It would perhaps be more precise to have all the pupils 
of the same age measured. If the latter were impossible the partial 
correlation device is next best. However, this procedure must be 
used with some caution Yule says: ‘‘ Hence r12.3 should be regarded in 
general as of the nature of an average correlation; the cases in which it 
measures the correlation between 21.3 and 22.3 for every value of 23— 
are probably exceptional.’”’? The effect of this procedure is slight 
showing that in these 64 cases the age factor does not affect the correla- 
tions to any large degree. 

Again referring to Table III it is seen that almost all intelligence 
tests correlate negatively with age during the high school years. 
Younger children not only have higher [Q’s but have attained higher 
mental levels than the older pupils. The correlations with age were: 


pees Bae aie ees Foe ry — .4l 
ats wie sah. dg OVA seo Wee Dot eee — .32 
PL Recta GOR Mer Re eid ess Ri — .28 
EN, «'n (kates eae a pen Wa Save vb eees — .24 


1 Yule, G. U.: ‘An Introduction to the Theory of Statistics,” pp. 251. 
2 Op. cit., p. 252. 





ts 











355 


























L*0° £99" eoececser CEO’ 819° ceoeecee Zc0° £19" ceoeces 620° OfZ’ ° "4804 dnoiy 
g 6S0° C9" ryt sons gota ~ Fre dhens cose betel EB antees 020° 90%" OI 
S 10° Ize" 6 cess heae > Piawbens osee key FE Resees 190° Zor: 6 
- £20" v9E" ee ae ee eo 690° Tg¢° 8 890° T8¢° 8 
S 290° 90$° L -. - 2S 690° LEV’ L LS0° 01g" L 
S 90° cer: 9 soe peta §, Givdvers $10 PFE 9 ec0° 009° 9 
SS 990° T8¥° ¢ ee Aa Sa 620° 92¢° ¢ 690° 679° ¢ 
3 990° CLP: _ aes <4 ~— saeco en eco’ 169° } 190° ISP’ $ 
™ 190° TE9° | € SO" G8g" € 990° GLY" e $20" L0E° € 
F 390° 6EP" G Tg0° 929° j 290° ¥Z9- j 190° STs" SG 
S 690° 6FS° [T usulleay, ||290° Str T 4981 |'620° 9EZ I sydry | 690° ogg" T 840 
3 QouesT][9FUI eo0UeSI][9} UT eouesIT[9zUI ous IT]9zUI 
4 Ad | JO Soyeulrysiy Ad | JO 8ezyeuIlysHy Ad | JO Sepeurnysy Ud | JO seyeuTysy 


















































19 = N ‘(sapaoe 
400) AONADITIGLIN] AO SULVWILSY SUAHOVAT, HLIM SLSHI-dNg JO ANV SLSU], dNOUy) AO SNOLLVIGUNOD— AJ ATAV], 


ITH 
PE 
082 
084 
072 
081 
084 
079 
081 
083 
082 
079 
079 
ns 


-* 
sagen tos 


nor 9 otras 








356 | The Journal of Educational Psychology 


All the group tests and all the sub-tests save one, Terman-9, correlated 
negatively with age. The five that correlated lowest are: 


NE ar, LOR Rte ee ee ee ee — .45 
es 5 ota eh eb aS soa ears — .38 
Ae a OT ee ee,” a — .35 
EES osc cee Ry ak foo dee be bees — .33 
es a uae ee en — .33 
RN Aer es. eT me a Ree a — .33 


If we believe with Chapman and Dale that “‘A test element in which 
the performance of the Young Bright exceeds that of the Dull Old is, 
except in unusual circumstances, tpso facto, a superior test of intelli- 
gence!” then the negative correlation with age would be quite signifi- 
cant and it could then be inferred that the higher the negative 
correlation with age the better the test is as a measure of intelligence. 

In Table IV are included the correlations with teachers’ estimates 
of intelligence. It is well to remember that the estimates were made 
by four mature teachers and that each pupil’s position was determined 
by the average rating given him by all four teachers. 


TaBLE V.—CoRRELATIONS OF GrRouP TEST AND SUB-TESTS WITH TEACHERS’ 
ESTIMATES, THE Factor OF AGE Constant. N = 64 



































| {| 
| Estimates | Estimates | Estimates | Estimates 
| of intelli- | of intelli- of intelli- of intelli- 
gence gence | gence | gence 
Otis 1 526 || Alphal| .188 || Miller1| .387 || Terman 1| .536 
2 .459 2 .622 2 .611 2 .448 
3 . 232 3 .421 3 .538 3 . 584 
4 . 386 4 ee) Tn wegwes ’ 4 442 
5 633 5 5 ae wae 5 .489 
6 554 | 6 cc ee Peer PS ae 6 . 390 
7 567 | ot Oe Oe ss... i. 7| .483 
8 559 || hw a Dee fae 8| .354 
9 + “ae Peery cae: Wane as 9| .392 
10 355 | ee! a ee) ee aoa 10 .518 
| | | 
Group test... 606 || ...... Pe oe ee re 636 
| | | | | 




















1Chapman, J. C. and Dale, A. B.: A Further Criterion for the Selection of 
Mental Test Elements. Jr. Ed. Psy., Vo. XIII, pp. 273-274. 








an e 


ed 


i- 
i- 


© 


a OD M 


Validation of Intelligence Tests 


357 


In considering correlations with teachers’ estimates of intelligence 


SOveoeveoes, COWPeseecevees ae eocees eee 802 @ ee € ans e8 © © 


SEP TF @2P. Fee ese OA OO ee 2gee2021?” €§ 8 Oe te CO. 2 Ce 8.86 6 Oo 8 ee 


PC@eee1eses © 86D 6 666 A? Ceesves €:OHe ee ee ewe ee e€e 6 6 O56 


owe Ce ee Cer e eevee ee aoe eee Se Oe ee ae ese eee ae Ce eS. ee 


which become slightly lower when corrected for age. 
highest sub-tests are 


73 
.61 
. 66 
. 68 


The five 


The average correlation for the four group tests with mental age is 
.64; with teachers’ estimates of intelligence .67. The effect of making 
age constant is to decrease the coefficients from .01 to .06. 

In considering the correlations with the learning test (Table VI) 
one must reinaember that this test is comparatively simple and that the 
amount of improvement made was the criterion used. 


TaBLE VI.—CorRRELATIONS OF GROUP TESTS AND SUB-TESTS WITH A LEARNING 
Test. N = 64 



































Learn- Learn- | Learn- Learn- 
ing | PE | ing | PE ing | PE ing | PE 
test | test | test test 
\| 

Otis 1 .243 |.079|| Alpha 1/| .120 |.083/) Miller 1/| .234 |.079|| Terman 1 . 189) .080 
2 .115 |.083 2! .307 |.075 2) .106 |.083 2 . 231) .079 

3 .182 |.081 3] .192 |.081 3} .159 |.082 3 . 239) .079 

4 .176 |.082 4| .265 |.078)| ...... ae ar 4 . 219) .080 

5 .209 |.080 5| .181 |.081)| ...... 5 . 313) .075 

6 .274 |.078 6| .205 |.080)| ...... 6 . 126) .083 

7 .293 |.077 7| .141 |.083]| ...... 7 . 135} .083 

8 .179 |.081 8] .152 |.082)| ...... 8 |—.125).083 
9 .172 |.082 1) ee aes rere 9 .097) .084 
10 002 |.084)| ...... ae enhans 10 . 103) .084 
Group test...| .228 |.080|| ...... . 206 080, ae (ae TEE 0 0400 0see . 207} .080 


























ATE PONE Sy Pon ee Rw: 


| 
iN 
i 
H] 
| 
| 
}) 
jj 


“Tet AOR I at OEE ET 


en oe a as 
PR POT as ee, 


<< 








358 The Journal of Educational Psychology 


Taste VII.—CorreELaTions or Group Tzsts AND SUB-TESTS WITH A LEARNING 
Test THE Factor oF AGE ConsTant. N = 64 














Learning Learning Learning Learning 
test test test test 
Otis 1 . 200 Alpha 1 .064 Miller 1 . 208 Terman 1 . 155 
2 043 2 .276 2 .091 2 . 230 
3 .093 3 .114 3 .129 3 .140 
4 .078 4 nn 1 seeaee aes 4 .172 
5 .162 5 Ms OE 6 kc0we ‘aoe 5 .313 
6 .191 6 _ SF eee abs 6 054 
7 . 276 7 ae «6M wcenns 7 .087 
8 .129 8 . en Pere 8 .102 
9 [ 2 OPP Vere my! are aes 9 .154 
10 ie Tl veenes pci Ne meni ae 10 .029 
| 
Group test..... 3 ees Fl, Se Waschkonks: .143 
































The correlations with this factor are low: 


ESR ae © oor rt eee, oer 23 
ee ee ee eee ert 21 
iain cca i ila tee ellen 21 
ETE aT ee ee 17 
The five highest of the sub-tests are: 
lt a a a ET ol 
5 sali len cree daa < OREN ol 
RS a 5. pare ct .29 
as 28 e eels eee Sexe .27 
I AE sae a an .26 


If intelligence is defined as the capacity to learn then the type of 
material to be learned will have to be defined, for in the type of learning 
in which an individual imaginally places a letter between two other 
letters designated by numbers the correlations are too low to be very 
significant. 

And finally the four group tests were combined into one composite 
and this was used as a criterion. This criterion is fairly important 
because with high school pupils there is accumulative evidence that 
two or three group tests combined give unusually good measurements 
of intelligence. 

The composite was made by computing the regression equations 
for each pair of group tests, then by assuming that the correlation 
occurring in the equation was 1.00; secondly, by means of the equations 








thu 
and 
wel 
wit 
fur 
the 


TA! 


Gre 


Ta 





= SS ie =e 


we Cr fer \e 


Validation of Intelligence Tests 359 


thus found substituting each pupil’s score in one part of the equation 
and thus obtaining the most probable score in the other. All scores 
were converted into Otis scores and the three converted scores together 
with the Otis score were averaged, this being used as a composite. For 
further description see the article by Breed and Breslich referred to in 
the bibliography and also the later pages of this paper. 


TaBLE VIII.—Corre.ations or Group Trests AND SUB-TESTS WITH A COMPOSITE 
Mabe uP or ALPHA, MILLER, OTIs AND TERMAN. N = 64 




















Com- Com- Com- Com- 
posite | *™ posite | *™ posite | = posite | FE 
Otis 1 | .718 |.041|| Alpha 1| .395 |.071!| Miller 1} .691 |.044|| Terman 1| .675 |.046 
2 | .748 |.037 2| .560 |.058 2| .787 |.032 2| .616 |.052 
3 | .476 |.065 3! .568 |.057 3| .782 |.032 3 | .834 |.026 
4 | .784 |.026 4| .779 |.033| ...... 4 | .644 |.049 
5 | .573 |.057 5| .618 |.052\| ...... 5 | .663 |.057 
6 | .593 |.055 6| .414 |.070| ...... ee, he 6 | .556 |.057 
7 | .663 |.047 7| .663 |.048|| ...... ee et 7 | .594 |.055 
s | .626 |.051 8| .781 |.033|| ...... s | .539 |.058 
9 | .566 |.057|| ...... tk he ee dt es 9 | .591 |.055 
10 | .488 |.064|| ...... fe” tess ae is 10 | .570 |.057 
Group test...) .926 |.012]| ...... 909 |.015|| ...... 901 |.016|| ......... 906 |.015 


















































TaBLE [X.—CorRRELATIONS OF GrovuP TESTS AND SUB-TESTS WITH A COMPOSITE 
Mapes Up or Aura, Minter, OTIs AND TERMAN, THE Factor oF AGE BEING 
Kept Constant. N = 64 














Composite Composite Composite Composite 
Otis 1 . 663 Alpha 1 . 354 Miller 1 . 652 Terman 1 .675 
2 . 730 2 . 559 2 .799 2 . 646 
3 .409 3 . 520 3 . 753 3 . 809 
4 .755 4 . 763 Tere rr 4 .627 
5 -561 5 .616 | mead ‘ane 5 . 584 
6 . 538 6 .378 | rer sawed 6 .517 
7 .674 | 7 643 eee 7 .579 
8 .609 || Se ee. arevus 8 544 
9 Aan..  amewes cas EE xasews eo 9 . 594 
10 .- Se Syeeer | mak. BPees ws ives 10 . 532 
Group test..... . fe eee eee ae eter ewe ee eee .901 






































Considering the tests as a whole there seems to be no real difference 
among the four tests in their correlation with this composite. We 
expect a somewhat high correlation with a composite since each group 
test itself is a part of the composite but no such similarity of results 


RRM 
PP 


GS RET PF Rar *s = ati. 


* calle 


ee ee ee 


SCRA 6 at ye > et 


NS Oe ene 9@ -aye 


oh REY Ey 


—— 


tee ee 
* —- 


eS a 


eS BE EG FOSS EE ee eT PR ne tt 
— 


SO LOT OP ISAT AD ia 


| 








SS 


mee 
> 








, tf 
he 
4 
| 


re 


360 The Journal of Educational Psychology 


could appear unless there were a marked similarity among the mental 
functions tested. There are considerable variations among the sub- 
tests. The five highest here are: 


Terman-3 (opposites).............. 0.00002 eee .83 
Miller-2 (cause and effect)..................... .79 
I 3 SOR Cc docs cacaseccn seers .78 
EE ee Ce 
ar ee er .78 


As a whole the correlations of the sub-tests are higher than they are 
with any other criterion. The two lowest of these correlations are 


Alpha-1 (oral directions).............. 239 
Alpha-6 (number completion).......... Al 


The general effect of making age constant is to make the correlations 
somewhat smaller (rarely over .05) in most cases but in some few cases 
to increase very slightly the size of the coefficient. 

Let us now consider the various types of material which make up the 
sub-tests. Seventeen different varieties of material compose the 31 
sub-tests. Analogies and mixed sentences occur in all four tests, 
arithmetic and opposites in three; information, number series, best 
answers, and hard directions in two; and sentence meaning, classifica- 
tion, cause and effect, geometric figures, proverbs, narrative completion, 
similarities, logical selection, and memory occur in oneeach. Table X 
shows the five tests correlating highest with various criteria. 


TaBLE X.—Tue Hicuest CoRRELATIONS AMONG THE SUB-TESTS WITH Srx CrI- 
TERIA. (FIGURES TO THE LEFT INDICATE THE NUMBER OF SUB-TESTS THAT 
WENT TO Make UP Tue Score INDICATED) 


Learning test Grades 
3 Arithmetic reasoning........... .28 3 Arithmetic reasoning........... 49 
1 Geometric figures.............. .27 1 Sentence meaning.............. 46 
1 Logical selection............... ~ nee en Tere 44 
III, occ ccccsccceves .21 4 Mixed sentences............... .43 
es dine a s-weenen s MA 1 CORIO, 0.0 co's dic ccccciuss 40 
Ee ene ee ee .24 I fos ceS tek eb dia emon 44 
Mental Age Teachers estimates 
RST ES Ee er .59 1 Cause and effect............... .63 
FN AP ee .57 1 Geometric figures.............. .60 
3 Arithmetic reasoning........... - 2° 9 SAAR ere .58 
2 Number sertes................. ee I a 6-0 nce Knee o0'sbe aoe .58 
GPUS Si cei cccacccceces's .50 3 Arithmetic reasoning........... 55 
PS edtausnsnwetx deskics .54 I a Gnade eee bsbies «aude er 59 








ee ee oe | 





tal 
ib- 


re 
re 


19 
16 
4 
L3 
LO 


0 


8 
5 
9 


Validation of Intelligence Tests 361 


TaBLeE X.—(Continued) 


Lowest with age Composite of Alpha, Otis, Miller, Terman 

1 Narrative completion......... ee .79 
1 Geometric figures............ —.37 1 Cause and effect............... .79 
f Pd cae s v00k4 cand ne | i LG ald 00a cence eneeeeec .78 
PR seene eke teneies yes —.29 2 Information................... .73 
S| Pe eaekes CALS —.26 1 Following directions (written)... .72 
Re ics Raa s SOS 60K — .34 p EY BE OE PL .76 


The opposite test stands out clearly ahead of the rest since it leads 
in its correlation with mental age and composite, comes fourth with 
teachers estimates and with age, and appears fifth in the learning test. 
It seems clear therefore that it would be unwise to omit it in any group 
of tests which were intended as a measure of intelligence. Arithmetic 
reasoning comes next followed by geometric figures and proverbs. 
It is rather interesting to know that every one of the sub-tests is 
represented in these thirty correlations. 

It also seems worth while to tabulate the four group tests together 
somewhat in summary of this section of the paper. The correlations 
with grade were taken from an article by the writer.*! 


TABLE XI.—SumMMARY OF CORRELATIONS OF Group TESTS WITH VARIOUS 
CrITERIA. N = 64 


























Learn- Mental | Teachers Lowest . 
| ing test Marks pom estinaies { wlth aap Composite| Average | Rank 
eet Sad .228(1) | .450(4) | .660(3) .730(1) — .408(1) .926(1) . 567 1 
Pe .206(3) | .476(5) | .687(1) .613(4) — .283(3) .909(2) .529 3 
PR tedne sue .174(4) | .476(5) | .530(4) .678(2) — .320(2) .901(4) .513 4 
Terman....... .207(2) | .492(1) | .680(2) . 663(3) — .244(4) .906(3) . 532 2 














If werank each of these tests within each of the criteria, then average 
the -anks, some indication may be had as to which is the best instru- 
ment for all round purposes. As far as our data go for testing intelli- 
gence in high schools, Otis is the best test of the four since it scored four 
first places and had an average of .567. Terman group test ranks next 
although there is little to choose between it and Army Alpha while 
Miller is only a little behind Alpha. The Miller Group Test receives 
no first place in any of the criteria chosen. This much for the data 
considered in mass; analytically they tell a somewhat different story. 
For example, the differences between the correlations of the four tests 
with the composite are negligible as are also those with the learning 


~~ 


\ 
s 
‘ 
- 
* 
+ 
- 


a 

a 

e 

s 

x 

3 

7 

é 

ui 

* 

i 

‘ 
7! 
,> @ 
1 

4 
i. 
: 


— 








362 The Journal of Educational Psychology 


test and with the grades. On the contrary with mental age, teachers 
estimates, and chronological age the differences are quite significant. 
It is noteworthy, that the positive correlations with the learning test 
are grouped around two-tenths, with marks around four and one-half; 
with mental age around five and six; with teachers’ estimates around 
six and seven; and with the composite around nine. 

Thus far nothing has been said about the value of the criteria 
chosen. Nobody knows exactly which should have the most weight 
although, personally, I should put first in this case, teachers’ estimates; 
second, Stanford-Binet mental age; and third, grades. 

The second division of the study concerns the consistency of the 
data. Tests that measure approximately the same mental processes 
should, it seems, within limits, place individuals in corresponding ranks 
in the four tests. Consequently if the scores in the tests were ranked 
from lowest to highest, the lowest third in one test would correspond 
largely with the lowest third in another. The following table throws 
light on this question. 


TaBLeE XII.—DIsPLACEMENTS FROM CORKESPONDING THIRDS IN RELATED TESTS. 
N = 64 





Otis Otis | Terman| Otis Alpha | Terman 
Alpha | Terman| Alpha | Miller | Miller | Miller 

















r .84 .78 71 81 .77 79 
I 7 9 12 5 8 8 
II 10 13 10 8 10 8 
III 5 6 8 5 6 4 
0 eee 34 42 47 28 37 31! 
Double displacement.. . 1 2 5 2 2 1 














1In a recent article (31) the writer published computations from a table of 
Thorndike’s which indicated that with a correlation of .70 the percentage of each 
third in one test falling in the corresponding third of the second would be theo- 
retically fifty-five out of a 100, and for .90, seventy-one out of a 100. Note that 
with a correlation of .71 we have fifty-three out of a 100 in their corresponding 
thirds; with a correlation of .84, sixty-four out of 100; with a correlation of .79, 
sixty-nine out of 100. Let us compare with these data some with Breed and 
Breslich (8). These authors find that with an r of .69 there is a correct placing of 
68.5 per cent in corresponding thirds, with a correlation of .77, 70 were placed in 
corresponding thirds. Thus we have some empirical evidence to compare with 
the theoretical expectancy. 





“~~ -— bk. 


— oe a 866 





Ts 


st 
f ; 
id 


ia 
ht 


1€ 
eS 
<S 


id 
7S 


Validation of Intelligence Tests 363 


From this table, Otis and Miller seem most alike since only 18 
persons out of 64, or 28 per cent, were placed incorrectly. But even at 
the best the condition seems bad enough. Here are pairs of tests with 
a correlation above .70, called “high” by Rugg and others, displacing 
almost half of the pupils not from corresponding tenths but from 
corresponding thirds. True it is that some of the displacements were 
near division points but not all of them were and in some cases the 
displacements were from the lowest to the highest third or vice versa. 
And yet many a prognosis has been made on a correlation of .50 or 
even less. 

In order to carry out even further and more precisely the investiga- 
tion of this question of differences existing between tests, a procedure 
using the regression equation and prophesying from one test what the 
score would be in the other was undertaken. Then by subtracting 
the transmuted score from the real score there was obtained a measure 
of likeness or difference between the two tests considered. In this 
case the assumption is made that the correlation is perfect so that we 
may get the score from one test which should exactly correspond with 
that of the other. ‘‘ Perfect measurement of a constant quality or 
trait would show no difference between two scores of this kind. Such 
a difference is a symptom of inaccuracy of measurement or of varia- 
bility of the thing measured, or of both. It represents for the teacher 
the amount by which he may expect two of these tests to differ in 
the measurement of the same pupil” (8 p. 61). Instead of using the 


formula y = y md x used by these authors, the writer transmuted this 
formula by placing for y its equal Y — M,, and for z its equal X — 
M,; thus changed, the formula becomes Y —M, = y"(X — M.) 


by means of which each pupil’s score may be transmuted from one 
test to another. The formulas for the transmutations were derived 
from the following data: 








Average SD Representation 
ey Rd ane eae 150.0 5.68 v1 
SE res ee 119.0 4.98 o2 
bide dccdsavedle 70.2 3.50 v4 
WN iicivdcdewete’ 128.5 5.94 v3 














LS RFE AS eee 


ae 


SREY APOE EEE SPT 
rz 


i 





- ——«-« ~ - 
. - 7 ~ 
“ - OF Re een EO eee 
- - 
te 








364 The Journal of Educational Psychology 


and are: 
For Otis and Alpha X1 = 1.14X_ + 14.34 (1) 
For Alpha and Otis X2 = .88X, — 13 (2) 
Otis and Terman X,; = .96X; + 21.6 (3) 
Terman and Otis X_. = 1.04X, — 27.5 (4) 
Army and Terman X, = .83X3 + 12.35 (5) 
Terman and Army X; = 1.19X. — 13.1 (6) 
Otis and Miller X,; = 1.62X, + 37.3 (7) 
Miller and Otis X, = .6X, — 19.8 (8) 
Army and Miller X_ = 1.42X, + 19.3 (9) 
Miller and Army X,4 = .7X2 — 13.1 (10) 
Terman and Miller X3; = 1.7X, + 9.2 (11) 
Miller and Terman X, = .59X; — 5.6 (12) 


With these equations before us! it is a simple matter to transmute 
scores. For example, the score of our first pupil in the Otis Group 
Test is 203 and his score in the Army Alpha is 172. Suppose we 
wanted to know what score would be expected on Alpha if Otis and 
Alpha measured the same mental processes and were perfect measuring 
instruments. By substituting 203 for X, in equation (2) we get X2 = 
88 (203) — 13 which solved becomes 165.64 or 166 which is six points 
less than the actual score 172. Again, if we had the Army Alpha 
score and wish to know what Otis would be we use equation (1) which 
becomes X, = 1.14(172) + 14.34. Solving we obtain X, = 210 or 
seven more than the actual score. Differences, then, were computed 
for each individual for all the tests taken two together. These 
differences were averaged and the average used as an indication of 
the discrepancy obtained when a pupil was measured on two tests. 

The large differences between points scored on tests called by the 
same name, 7.e., ‘“‘intelligence tests’’ are even more clearly shown in 
this table than in the previous discussion of thirds. In individual 
cases the discrepancy between scores is enormous. If we leave out the 
71 points which undoubtedly was due to some failure by the individual 
to cooperate still 45 or 42 or 49 is a very large difference and on the 
Army Alpha would correspond to almost three years of mental growth. 

If we attempt to compare the tests by using the average of the 
differences of the score in the transmutations of three tests into the 





1 The writer believes that these equations are of general validity and may be 


used to transmute scores from one test to another with the smallest possible chance 
of error. 





TA 





Nee eee” 


" VSS eee CO GF OD 


> FF ae SOE e Y 


Le os «6CY — a 6 


Validation of Intelligence Tests 365 


TaBLE XIII.—AveraGce DIFFERENCES BETWEEN ACTUAL Scores AND ScoRES 
TRANSMUTED ACCORDING TO PRocEDURE DESCRIBED IN TEXT 





























Greatest} Smallest Aver- 
differ- | differ- yey age 
ence ence three 
Average of three 
standard devia- 
tion Otis 
Army Transmuted and Otis Actual......... 38 .04 12.8 
Terman Transmuted and Otis Actual....... 45 | 15.6 | 13.9 2.45 
Miller Transmuted and Otis Actual......... 39 .58 13.5 
Average of three 
standard devi- 
g ation Terman 
Miller Transmuted and Terman Actual..... 71 Zi 13 
Army Transmuted and Terman Actual...... 49 .2 17.2 | 15.4 2.59 
Otis Transmuted and Terman Actual....... 41 . 16.2 
Average of three 
standard devi- 
ation Army 
Miller Transmuted and Army Actual....... 45 By 13.4 
Terman Transmuted and Army Actual...... 41 2 14.8 |} 13.1 2.63 
Otis Transmuted and Army Actual......... 31 3 1l 
Average of three 
standard devi- 
ation Miller 
Otis Transmuted and Miller Actual......... 23 .6 8.3 | 
Terman Transmuted and Miller Actual..... 42 8 7.8 8.6 2.46 
Army Transmuted and Miller Actual....... 31 .2 9.7 | 





1 Undoubtedly an exceptional case. 


one test we find almost immediately a difficulty. A casual glance at 
the four averages, however, reveals the fact that those tests which 
have the smallest range have also the smallest average difference. For 
example, Miller has a possible range of 0 to 120, while Alpha has a 
possible range of 0 to 212, and Otis from 0 to 230. A change of one 
point in Miller is almost equivalent to two in Otis. It seems reason- 


~ or 


RPT SE ES FR ee oe oe 


mage 


| 


—ee 
1S pare 


wer 


ae . RS 


o 
=~ Aa 


- 


——— 


> 











Ws sis 
Pine gee eee 


2 


366 The Journal of Educational Psychology 


able therefore to divide the average difference by the standard 
deviation of the actual test into which the three other tests were 
transmuted. This method places Otis first, Miller second, Terman 
third, and Army Alpha fourth. This may be interpreted, then, that 
Otis shows a less variation from the expected score than any other test 
and in this respect is superior. 

To summarize: Large differences are found between intelligence 
tests in placing pupils even in so gross divisions as corresponding 
thirds, there being from 28 to 47 per cent displaced according to the 
closeness of the relationship. Transmutation of scores by assuming 
perfect correlation in the regression equation finds also large differences 
between scores assumedly the same, the range being from 7.8 points 
to 17.2 on the average. The differences in individual scores may run 
as high as 45 or 50. The inference is, therefore, that even when we 
make due allowance for the variability in human nature there is still 
a large residuum due to the (1) imperfections in the tests, or (2) to 
development of the tests along somewhat different lines caused by the 
lack of agreement in the definition of intelligence, or (3) to some other 
undiscovered cause or causes. 


(To be continued in October) 











abi 
fac 
to 
of 
bo 
the 


sic 
res 
to 
su 


ed 
ur 
in 


ar 
al 
SC 





orl Oo f 


TwermuUL'VWT =v wee wae O&O wa oe 


AN EXPERIMENTAL STUDY OF THE RELATIVE 
DIFFICULTY OF TRUE-FALSE, MULTIPLE- 
CHOICE, AND INCOMPLETE-SENTENCE 
TYPES OF EXAMINATION QUESTIONS 


H. H. REMMERS, L. E. MARSCHAT, ‘ADELAIDE BROWN and ISABELLA 
CHAPMAN? 


Colorado ‘College, Colorado Springs, Colorado 


In a recently published article,? the present writer urged the desir- 
ability of norms and standards of achievement in the mastery of the 
factual material of textbooks. It seems to him desirable to carry over 
to classroom instruction and examinations the more objective methods 
of the builders of mental and educational tests. McCall in his recent 
book*® points out the importance of examinations in the economy of 
the work of the schools of this country. Knight‘ and Barthelmess® 
have also published articles bearing on the problem here under con- 
sideration. Much of the literature on psychological and educational 
research of recent years could be cited as applying more or less directly 
to our problem, but it is not the purpose of the present study to give a 
summary of this literature. 

The writer, while discussing these types of questions in a class in 
educational psychology was met by the objection that they were of 
unequal and undetermined difficulty, and that they were undesirable 
in that they tended to suggest to the examinee the required answer. 
The latter objection made particularly concerning the multiple-choice 
and true-false types of questions, has been met in the references cited 
and will not be treated here. It was determined, however, to secure 
some objective evidence on the relative difficulty of such questions. 


—_ 





1 Responsibility for this study is divided as follows: Marschat, Brown, and 
Chapman collected the data at the suggestion and under the supervision of Rem- 
mers. The statistical treatment of the data was done by Marschat and Remmers. 
For the interpretation and publication of the results the latter is wholly responsible. 

* Remmers, Hermann H.: A Suggestion to Writers and Users of Text-books. 
School and Society, March 3, 1923, pp. 243-4. 

* McCall, Wm. A.: “How to Measure in Education,’’ pp. 119. 

‘ Knight, F. B.: Data on True-false Test as a Device for College Examination. 
Jour. of Educ. Physch., February, 1922, pp. 75-80. 

* Barthelmess, H. M.: Reply to a Criticism of Tests Requiring Alternate 
Responses. Journal of Educational Research, November, 1922, pp. 357-59. 

367 


Ned TPS & 


RFE ITER LY GORGE EO CR EE hg 


“tae 


pat 9) 


i 
ti 
; 
’ 


~ gee ee eee 
a 











" _ 
Sik ie RT Gro Rs 


Cn nee 
re 


368 The Journal of Educational Psychology 


SUBJECTS AND MATERIALS OF THE EXPERIMENT 


Fifty-six members of an elementary course in psychology were used 
as subjects. Each of these had been given a mental test (Otis Group 
Test Form A) and on the basis of the obtained scores, the students 
were divided into four comparable groups, 14 in eachgroup. Among 
those competent to judge, the most commonly accepted definition of 
whatever it is that is measured by such tests of ‘‘general mental 
ability”’ is perhaps ability to learn. It was therefore assumed for the 
purposes of this experiment, which tested the ability to learn and 
retain certain specific bits of factual material, that these groups were 
approximately equal. It may be argued that this was an unwarranted 
assumption. It will be more profitable to consider this objection after 
the data of the experiment have been presented. 

In the course in psychology the students are required to attend 
one laboratory period of two hours each week. The first part of this 
laboratory course is given over to a rather intensive study of the 
development of the central nervous system. From charts, diagrams, 
microscopic slides, and models of the brainand spinal cord the students 
were required to make and properly label a specified number of draw- 
ings. They were also required to hand in with each drawing, five true- 
false statements, five multiple-choice statements, and five incomplete 
sentences on the material studied. All labels on the drawings were 
carefully checked, and no drawing was finally accepted and graded 
until all labels were entirely correct. In this way it was assured that all 
students were exposed to all the material used in the experiment. It 
was from the examination questions that the students themselves 
handed in, that the test questions used in the experiment were selected. 
Sixty questions each, of all three types calling for the same specific 
bit of information were culled from the mass of material at hand. 
A sample of each type of question follows: 


True-false.—The anterior third of the embryonic spinal cord forms 
the brain. 


if ° 
‘ third 
M ultiple-choice.—The anterior } half of the embryonic spinal 
cord forms the brain. two-thirds 


Incomplete-sentence.—The anterior————of the embryonic spinal 
cord forms the brain. | 











sed 
up 
ats 
ng 

of 
tal 
he 
nd 
re 
ed 
ber 


nd 
11s 
he 
1S, 
its 


e- 
te 
re 
ocd 
all 
It 
es 


~o 


1S 


al 


al 


Relative Difficulty of Examinations 369 


METHOD AND DATA 


The 60 questions of each of the three types were mimeographed, 
and about six weeks after the portion of the laboratory course dealing 
with the development of the nervous system had been completed, the 
four groups of students used in the experiment were examined without 
having been given previous warning. The questions were distributed 
as follows: 


Groups 1 and 2 Form B (multiple-choice) 
Group 3 Form A (true-false) 
Group 4 Form C (incomplete-sentence) 


The students, already familiar with the various types of questions 
were instructed as follows: 


‘‘What we are about to carry out is not an examination, but an 
experiment. Your performance in this experiment will in no 
way affect your grade in the course. We are merely attempting 
to discover the relative difficulty of the different types of questions 
that have been passed out. In order to have the experiment a 
success, however, it will be necessary for you to do your best.”’ 


A spirit of cordial cooperation was apparent among the partici- 
pants. Sufficient time was allowed for all to finish. Table I gives the 
mental rating (raw scores), the scores on the test questions, and the 
averages and variability in terms of SD for each group. 

How do we know whether the difference between any two of the 
average scores is a significant one? And if significant, by how much? 
Specifically, is there any warrant for thinking that the difference 
between the average scores of groups 1 and 2 is or is not significant? 
By using McCall’s technique for calculating the ‘‘experimental coeffi- 
cient’’! we find that there is not; the chances are approximately 1 to 1 
that this difference of 1.5 means nothing. The formula for this 
coefficient follows: 

Experimental coefficient = 2.87 X o diff. 

Where o diff. = Vo)? + on)”. 





1 Op. cit., pp. 404. 


SFL PRY + 3 ants 


4 
é 
t 

a 
‘ 











370 


The Journal of Educational Psychology 


TABLE I.—SHOWING THE MENTAL RatTIna, Score ON TEST QUESTIONS, AVERAGES 
AND STANDARD DEVIATION FoR Eacu Group or SuBJECTs 
























































Form A Form B Form B Form C 
Group 3 Group 1 Group 2 Group 4 
we) ~ o | 
32) ge \32) eof Fl Be] el FB] 38| - 
— SS /82\/3 | S8/88\| 2) F3/ 88) 3) F888 
= le ala"“|e*] al ale PF oala"la® 
Wi 204 26 || Ca |207 38 ‘iii 212 41 || Wl /210 21 
Go 200 18 St (204 31 Mi |197 30 He /|200 20 
Ma 199 5 Ea |199 33 || Ni |197 32 We /198 28 
Ny 198 16 Da |198 47 No |196 33 Wn) 194 26 
St 190 20 Br |190 33 || Pu {192 26 Ki |193 9 
Mo 186 8 Wh/181 35 || Cl |184 27 Ba |180 26 
Il 186 3 Fa |181 34 || Cr /|181 39 Ar |180 25 
Ga 176 22 Li |175 38 || Ro |176 28 Sa |176 18 
Ta 168 17 Co |175 35 || El |172 35 Pa |169 27 
Mt 166 14 Mc /168 31 KO |169 28 Co |168 20 
Pe 162 6 Gr |168 34 Ho |168 36 Re |168 16 
No 161 12 SW/165 22 Re |160 32 Wr |166 20 
Ed 158 11 Cl 155 26 Ed /|149 36 Dr |157 19 
Mo 147 16 Cp |122 33 Fa |135 26 Be |126 9 
| = 
ES is cine oh 177.79| 14 | 177.78|33.57, 177.78) 32.07 177.78 20.3 
Standard deviation......|...... 6.5) = j....-. 5.5 | s iectaiad “* 2 Tie See | §.8 





























Table II gives the data for all the differences. 


TasBLe IJ].—SHowinGa tHe EXPERIMENTAL COEFFICIENT OF THE DIFFERENCES 
AMONG THE Four Groups, AND THEIR RELATIVE SIGNIFICANCE IN TERMS 
oF APPROXIMATE CHANCES 








Experimental | Approximate 
Groups Average scores coefficient aiiietnn 
1 and 2 (multiple-choice)................ 33.57 and 32.07 .07 1 to l 

1 and 3 (multiple-choice and true-false)........ 33.57 and 14.00 .8 75 tol 
1 and 4 (multiple-choice and incomplete)....... 33.57 and 20.30 .57 17 to 1 
2 and 3 (multiple-choice and true-false)........ 32.07 and 14.00 .72 4leto 1 
2 and 4 (multiple-choice and incomplete)....... 32.07 and 20.30 . 54 15 to 1 
4 and 3 (incomplete and true-false)............ 20.30 and 14.00 .25 3 to 1 














From the above it is apparent that, on the assumed equality of the 


| four groups the probabilities are very high that there is a significant 
difference between the difficulty of multiple-choice and true-false 
statements; that the probability that the incomplete sentence is more 





ne perenne 





diffi 
stal 
tru 


ma 
the 
anc 


An 
stu 


of | 
me 


M 


sl 


"“~ O 8 


iI 


Fo & 


OQ = &@ WD OC. 





GES 


4tev 
score 


| 


aS 
Co w 


he 
nt 
se 
re 


Relative Difficulty of Examinations 371 


dificult than that of the multiple-choice type of question is sub- 
stantial;! and that the difference between incomplete sentences and 
true-false statements is not great enough to be regarded as significant. 

It might be argued that it follows from the assumption of approxi- 
mate equality of the four groups as determined by Otis scores, that 
there should be a high positive correlation between these mental ratings 
and the experimental scores. This was not found to be the case. 


Another criterion that was available was the semester grades of the 


students. Using the formula R = 1 — ae the correlation of each 


of these criteria (7.e., Otis scores and semester grades) with the experi- 
mental scores was found as listed in Table ITI. 


TaBLeE III.—SHowi1nG THE CORRELATION BETWEEN OT1s SCORES AND EXPERI- 
MENTAL SCORES AND BETWEEN SEMESTER GRADES AND EXPERIMENTAL SCORES 








R between Otis scores and | R between semester grades 
experimental scores and experimental scores 
or ee + .262 - +.339 
isk 60% cere kana + .047 + .370 
GL id bach oie 6m debe + .154 + .247 
GE Ss etetec¥scunen + .324 — .123 











In view of the small number of cases in each group it was not con- 
sidered worth the statistical labor to use the more accurate product 
moment method of correlation. One such r was calculated between 
Otis scores and experimental scores (Group 3, Form A):r = .179+.201. 
The difference between R and r is too slight to warrant the work 
involved. Obviously the PE of each of these coefficients is so large as 
to make them highly unreliable. By using McCall’s Transmutation 


1 A theoretical consideration concerning the method of scoring should be taken 
up here. Just as in the true-false statements half of the answers would be correct 
by pure chance, so in the multiple choice statements one-third of the answers would 
be correctly answered if only chance operated, since only three answers were pos- 
sible to each question. If this had been taken into account in scoring the multiple 
choice statements, the average scores for groups 1 and 2 would have been 22.38 and 
21.38 respectively, and the difficulty of this type of statement would then be 
very nearly that of the incomplete sentence type. In the ordinary standardized 
mental and educational test this factor of chance insofar as it concerns multiple 
choice statements is usually disregarded. 














, 

i 
- 
ey 
4 


372 The Journal of Educational Psychology 


Table! R may be transmuted to r, and PE calculated from the usual 


—s 
formula: PE, = 67452 Ti 


Does it follow then that the assumption that these four groups 
were equal is invalidated? Not necessarily. There is a reasonable 
supposition, at least, that our assumption was substantially correct 
inasmuch as groups 1 and 2, both tested with Form B, were proved to 
be practically equal so far as this test is concerned. The two available 
criteria of equality (Otis scores and semester grades) have apparently 
about the same value. An extension of this experiment? is required to 
demonstrate conclusively that the obtained differences represent valid 
differences. The data of this experiment do, however, furnish a strong 
presumption in the support of such a conclusion. 





1 Op. cit., pp. 393. 


2 I shall be glad to receive criticisms of this study with a view to extending the 
investigation here reported. 


allo’ 
wou 
and 
grol 
for | 
ardi 
it h 
tha 
inst 
he | 


90 
stat 


sun 
ard 


Tat 


Per 


co 
in 
ea 
su 


en 
pr 





ual 


Ips 
ble 
act 
to 
dle 
tly 


lid 
ng 


A UNIFORM OBJECTIVE EXAMINATION ON 
INTELLIGENCE TESTING 


DENTON L. GEYER 
Chicago Normal College 


A standardized examination in the field of intelligence testing would 
allow the instructor to compare his class with similar classes elsewhere; 
would enable him to experiment with different methods of teaching; 
and, as compared with the essay examination, would cover more 
ground, save time in marking papers, and furnish an objective basis 
for dealing with indolent or incompetent students. In brief, a stand- 
ardized examination would have the same advantages in college that 
it has elsewhere. But in colleges it would, of course, attempt no more 
than to cover certain agreed-upon minimum essentials, leaving each 
instructor free to supplement this section of the course in any way 
he chose. 

As an initial step toward such an examination, a true-false test of 
90 statements and a multiple-answer or recognition test of 60 
statements, based on the Yearbook on Intelligence Testing issued 
by the National Society for the Study of Education, were mailed to 
summer school classes in July, 1922, for preliminary try out and stand- 
ardization. 'The norms thus secured are shown in Table I. 


TaBLE I.—Scores AND PERCENTILE Ranks For 540 SrupEents In 19 UNIVER- 
SITIES, COLLEGES, AND NORMAL ScHOOLS 




















tee 
Percentile rank................+4. 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 
| | 
Score on multiple answer.......... 30 | 33 | 36 | 38 | 40 | 42 | 44 | 47 | 50 
Score on true-false................ 26 | 34 | 40 | 46 | 50 | 54 | 58 | 62 | 70 

















Whether there is sufficient uniformity in the content of various 
courses on intelligence testing to make standardization feasible was 
investigated by computing the percentage of each class which missed 
each question. Assuming that answers can not be easily guessed, or 
supplied from general information, if the number missing a given 
question in various institutions is about the same, the distribution of 
emphasis among various topics is presumably similar. These results, 
presented for the multiple-answer test in Table II, show that when the 

373 


9 0 AMI BETES Rieger pe 





a Se eee eee 


a 


ae 


ee at ee See 








- * 
aig ated | dare 
OP PD 8 


; 
t 
: 


9 iether - 


374 The Journal of Educational Psychology 


TaBLe II.—FREQquENCY or DIFFERENCES AMONG COLLEGES IN PERCENTAGE oF 
Crass Missina Eacu QUESTION OF THE MULTIPLE-ANSWER TYPE 








Chicago | Chicago | Chicago | Minne- | Minne- Chicago 
Normal Totals 
‘ and and and sota and| sota and . 
Differences : , Fr ' ae and Totals | in per 
Minne- | Chicago} Michi- | Chicago| Michi- a a 
Michi- cents 
sota Normal gan Normal gan 
gan 
0-9 34 37 23 27 24 27 172 48 
10-19 s 10 21 15 17 14 85 24 
20-29 12 3 11 7 14 15 62 17 
30-39 2 7 1 4 2 3 19 5 
40-49 3 4 4 5 1 iis 17 5 
50-59 i -_ he 2 1 ae 3 
60-69 1 és Bs —e i oa 1 1 
70-79 ae - a in 1 ae 1 





























percentage of a class missing each question in each of four representa- 
tive institutions is subtracted from the percentage missing the cor- 
responding question in another of these institutions, the remainders 
are very small. About one-half of the remainders are less than 
10, about three-fourths are less than 20, and about nine-tenths are 
less than 30. The agreement for the true-false test is slightly closer. 
Only one of these four institutions had used the “Yearbook” as a 
textbook. These results would seem to show that the courses now 
given on intelligence testing are much more alike than we might have 
supposed. 

To get some light on the validity of these tests, their scores were 
correlated in the Chicago Normal College with average scholarship 
for the previous semester. The true-false type, in an earlier and pre- 
sumably less perfect edition, correlated at 0.31 (n = 56); and the 
multiple-answer type in the edition sent to summer schools at 0.45 
(n = 30). A more recent revision of the multiple-answer type (one 
that eliminates items shown by the summer returns and by requested 
criticisms to be too easy or too hard or debatable or of minor impor- 
tance) correlates with scholarship at 0.51 (n = 56). These correla- 
tions for the multiple-answer type are as close as those in this college 
between any two studies. Observation through several semesters 
has also shown that students who have failed in some other course are 
usually placed by this test in the lower quintile, and that occasional 


graduate students in a class of sophomores are placed in the upper 
10 per cent. 








ratil 
as L 
indi 
grea 


min 
men 
cont 
limi 
inte 
are 

to ¢ 
afte 


app 
mit 
the 
wh 


of I 





TS 


A Uniform Objective Examination 375 


A true-false test of 90 items, though apparently satisfactory for 
rating a class as a whole, is probably too much affected by guessing 
as Dr. Hahn has recently pointed out! to be of much value in rating 
individuals. If this test is similarly revised it will therefore be 
greatly lengthened. 

In the revised form of the multiple-answer test guessing has been 
minimized by supplying four or more ways of completing each state- 
ment, and no statement has been retained which does not seem to 
contribute some item of importance. The field covered is not now 
limited by the distribution of emphasis in the ‘‘ Yearbook”’ but is in 
intent determined by considerations as to what the minimum essentials 
are in the field of group intelligence testing. Giving the revised form 
to one class before it has taken the older test, and to another class 
after it has taken the older test, shows that the norms of last summer 
apply to the revision without change. 

Of course much better tests than these could be made by a com- 
mittee drawing material from the objective examinations now used by 
the numerous persons teaching this subject. Would it be worth 
while to have such a committee created? 





1Habn, H. H.: Criticism of Tests Requiring Alternate Responses. Journal 
of Educational Research, Vol. VI, October, 1922, pp. 236. 








COMMUNICATIONS AND DISCUSSIONS 





NOTE UPON HOLZINGER’S FORMULA FOR THE 
PROBABLE ERROR 


TRUMAN L. KELLEY 
Stanford University 


In Dr. Holzinger’s article, An Analysis of the Errors in Mental 
Measurement, appearing in this journal, May, 1923, occurs anew form- 
ula for the probable error of the mean: 


67450 .+/2 — rie 
/N 6) 


One important point seems to have been overlooked in the derivation 
of this formula. In particular the use, in this connection, of the 


equation PEa. = ~/P.E..?+ P.E.,? (found near the foot of p. 284) 
is not warranted. 


Using the same notation as Holzinger, we have; X; = X’ + 4, 
in which X, is the obtained score, X' the individual’s true ability and 
5, his ‘‘response error.’’ The individual’s true ability, X!, may be set 
equal to M, + z', the mean for the group plus the true deviation of the 
individual from the group mean. If we possessed true scores for all, 
the members of the group z! could correctly be called the error of 
sampling, 7.e. the taking of an individual true score as evidence of the 
mean score would involve an error of this magnitude, z.!_ It is in this 
sense that Holzinger, according to my interpretation, has defined the 
“error of sampling.’”’ However, he considered the standard deviation 
of the error of sampling when thus defined to be o,, but this is not the 
case when fallible measures are used. 


Xi =M,+ 2’ + 6, or X; — M, =7'+ 5, 
If we let x, = X, — M;, this may be written, 
w= 2'+ 6; 


and if x’ and 6; are uncorrelated we have, 





PEm = 





Cx,2 = O22 + 05,2 


Thus ¢,,: is not equal to the square of the standard error of sampling, as 

the term has been defined by Holzinger, but is equal to the sum of two 

things, o,,: and o;,:. Since o,,: has o;,: involved in it this quantity 
376 








sho 
equ 


sta 
sco 
S10! 
ant 
fro 





_ 


‘tal 


(6) 


ion 
the 
84) 


nd 
set 
che 
all, 

of 
he 
his 
he 
on 
he 


as 
vO 
ty 


Communications and Discussions 377 


should not be added a second time as it has been done to obtain 
equation (6). 

‘Sampling error” in terms of its derivation and as regularly treated 
statistically means more than the error involved in taking a single true 
score as indication of the mean score. We should usesome such expres- 
sion as ‘‘the deviation of the individual in true ability from the mean”’ 
and not ‘‘sampling error’ to designate that part of the total deviation 
from the mean which is not represented by the “response error.’’ 





REPLY TO PROFESSOR KELLEY’S CRITICISM 
KARL J. HOLZINGER 
University of Chicago 


I wish to thank Professor Kelley for his critical scrutiny of formula 
(6). Itseems to me, however, that if the formula fails it is due rather 
to certain assumptions in the proof than to the erroneous addition of 
errors as he suggests. To fix ideas, I shall use his notation and 
equations as far as possible. Let 

m= 2' +5 
where 7; = X; — M;, 6 = X; — X' andz' = X! — M, = “‘deviation 
of the individual in true ability from the mean.” Now if we write 
z'= 2, +6 and assume 2; and 6 uncorrelated, it is evident that 
O72 = O72 a 0 52 
But C3 = o;7(1 —_ 112) 


by equation (3) in my article. Therefore, 
O22 = O2,2(2 ao 112) (1) 


and formula (6) follows at once, since oz, = ox. Thus Professor 
Kelley’s reasoning may lead him to the formula to which he is objecting. 
Suppose, however, that we write x; = x' + 6 and assume that z' 
and 6 are uncorrelated. It then follows that 
O22 = O22 + 03 
or, O22 — Ox,2 — Op 
But, O32: = O2,2(1 — rie) 


as cited above. Therefore, 


C21"? = o17r 12 (2) 














= 
SE gE bear 


eee 


= 33s 


378 The Journal of Educational Psychology 


Kelley! has worked out a lengthy proof of formula (2) and expressed 
it in the notation, o. = o,?r, but the formula is wrongly ascribed to him 
by Thorndike and others inasmuch as it is identical with formula (5) 
on page 213 in Yule. It is, as Yule points out, merely a part of 
Spearman’s formula for the correction of the correlation-coefficient. 
That such careful workers as Thorndike and Kelley should have 
overlooked Spearman’s formula is very regrettable. 

Returning to the main argument, it appears that the two assump- 
tions cited above lead to widely different results, and if formula (1) 
is at fault it is because of a wrong assumption 7.e. that z, and 6 are 
uncorrelated. As a choice between the two assumptions it would 
seem more reasonable to expect x! and 6 to be uncorrelated than 
x, and 6, inasmuch as 6 is a part of z,;. Furthermore, I myself found 
some evidence of correlation between x; and 6 (page 283 of article 
under discussion). I therefore suggest that formula (1) be held in 
abeyance until further evidence is at hand. I do not recommend the 
use of formula (2) as a substitute because, even though more reasonable 
then (1) assumptions as to the correlation of errors are still not 
clearly justified.2 Much more careful inquiry is needed in the subject, 


and if this discussion in any way prompts such study it may not be 
entirely amiss. 





Editor The Journal of Educational Psychology, 
Sir: 

I was amazed at the sort of review my “‘ Behaviorism and Psychology”’ received 
at the hands of Mr. A. I. G. whom, did the initials not give me some clue as to 
his identity, I might have taken to be one of those elementary students who 
“will of course find such a book unintelligible.” Mr. A. I. G. implies that my work 
is not thorough because it is “‘based mainly on a series of controversial articles and 
a few books” (viz., over 200). Does the reviewer think I might have made use of 
books and articles that have not yet been written in order to do justice to the behav- 
iorists? Ah, yes, I have been stupid enough to discuss books and articles instead 
of “‘fundamental trends,”’ and what is more I said nothing about Judd’s writings on 
motor attitudes. True enough, nor have I mentioned the doctrine of transub- 
stantiation or the Ptolemaic system of astronomy. Goodness knows why I must 
consider in a critique of behaviorism all the psychologists that Mr. A. I. G. can 
think of. And why, pray, must I dwell at length on “such an extensive theory 2s 
those (that?) of Washburn in Movement and Mental Imagery” when I have referred 


1 Kelley, Truman: The Reliability of Test Scores. Journal of Educational 
Research, May, 1921, pp. 379. 


? Brown and Thomson: “Essentials of Mental Measurement.’”’ pp. 160. 











sed 
him 

(5) 
t of 
ent. 
ave 


mp- 

(1) 
are 
yuld 
han 
und 
‘icle 
1 in 
the 
ible 
not 


ect, 
; be 


ived 
s to 
who 
vork 


3e of 
hav- 
tead 
s on 
sub- 
nust 

can 
y as 
rred 


onal 


Communications and Discussions 379 


to passages in this very book as well as to several articles, in addition, by the same 
author to show that she is decidedly hostile to behaviorism? 

The reviewer’s statements and claims are astonishingly flimsy. What does he 
mean by saying “‘In treating the ‘Varieties of Behaviorism’ and other topics, we 
are given (he gives us?) a series of quotations some of which may have been 
unwittingly (sic) made . . .”’ when I have done my best to select the most 
characteristic passages in every case; and what else could I have done than to 
content myself with very brief expositions of the various behavioristic writers, if 
the book was not to become so bulky as to prevent its publication? In order to 
represent the movement of behaviorism fairly, I cited and characterized the views 
of about 50 behavioristic writers of various shades and tints. Had I devoted only 
two pages on the average to each writer, the chapter on varieties of behaviorism 
alone would have exceeded 100 pages. 

Perhaps had the reviewer been less averse to logical treatment and systematic 
development of a subject, he would not have taken exception to my examining the 
logical setting of behaviorism nor would he have regarded as irrelevant the appen- 
dix “‘How is Psychology to Be Defined?’ He certainly would not have seen in 
the book any discussion of the “incompatibility of behaviorism and philosophy,” 
nor would he have demanded “‘a comprehensive view of movements within the 
science itself’? (what science?) Mr. A. I. G. does not seem to realize that my 
task was to examine the behavioristic movement and its relation to psychology, 
and not to delve into the foundations of psychology. 

To conclude, a brief review does not necessarily have to be a perfunctory and 
carelessly written notice. A serious approach on the part of the reviewer is essential 
in every case whether the review be sympathetic or adverse. 

A. A. RoBAck, 

Harvard University May 8, 1923. 














NEW PUBLICATIONS IN EDUCATIONAL 
PSYCHOLOGY AND RELATED FIELDS OF 


Sa EDUCATION le 











CONDUCTED BY LAURA ZIRBES! 


Economy in Secondary Education.—Certain experimental activities 
carried on during recent years in the laboratory high school of the 
University of Chicago have been described by members of the faculty 
in a set of reports of unusual interest and suggestiveness. These 
reports! deal with four major lines of experimentation seeking respec- 
tively: (1) ‘‘Better and more economical material,” (2) ‘‘Better and 
more effective technique of teaching,” (3) ‘individual personnel 
study,” and (4) ‘‘more effective institutional organization.” 

Outlines of a two-year sequence in history, now constituting 
practically the whole offering in that subject in the school, are pre- 
sented and discussed by H. C. Hill and A. F. Barnard. The first of 
these courses, elective for sophomores, is a Survey of Civilization which 
stresses community life in different periods from primitive man down, 
but includes relevant narrative as well. It is a venturesome depar- 
ture from the traditional in high school, but, when the strict limitation 
of time for history is considered it appears to be remarkably well 
designed to give the perspective and the background needed for an 
intelligent study of modern events and conditions. The second 
course is one in Modern History covering “the chief movements and 
most significant features in the history of the United States and Europe 
since the middle of the eighteenth century.”’ - Both courses are divided 
into seven or eight units of work, a mastery of the minimum essentials 
of each being required before credit is given. Reference books found 
to be particularly helpful in the course are referred to in the discussions. 

The effort to improve the technique of teaching in the school has 
resulted in the adoption of a procedure calculated to insure pupil 
mastery of units of work, to take account of individual differences, 


1 “Studies in Secondary Education. I.’’ University High School, University 
of Chicago. Supplementary Educational Monographs, No. 24. Chicago Depart- 
ment of Education, University of Chicago, 1923. 


380 





—/- tet G@utco eo am, ie eee lUCUnlC CO 








Lies 
the 
Ity 
ese 
eC- 
nd 
nel 


ing 
re- 


ich 
mn, 
ar- 
on 
ell 
an 
nd 


nd 
pe 
ed 
als 
nd 
18. 
as 
pil 
aS, 


ity 
rt- 


New Publications 381 


and to develop habits of effective independent study. Credit for this 
procedure is given several times in the monograph to Professor Henry C. 
Morrison, Director of Laboratory Schools. Its application in history 
is clearly described in the report of Hill and Barnard, in Elementary 
Science by W. L. Beauchamp; and in English Literature by Ernest F. 
Hanes and Martha J. McCoy. This procedure, Herbartian in descent, 
calls for five explicit steps in instruction: exploration, a preliminary 
probing for existing relevant knowledge; presentation, a prospectus 
from the teacher of the unit of work to be covered; assimilation, the 
real work of study; written organization or recitation; and oral recita- 
tion. Of these five, assimilation is regarded as much the most impor- 
tant and occupies from one-half to three-fourths of the time devoted 
toaunit. During assimilation the class may work for weeks without a 
recitation as commonly understood. In fact, throughout the procedure 
there is very little to remind one of the question and answer recitation 
to which most of us are stilladdicted. Emphasisislaid upon each pupil’s 
mastering all minimum essentials for himself. The testing of his 
accomplishment is searching, and failure to show mastery means 
re-study and re-testing. Wehave here no soft orsugar-coated pedagogy. 

In a report of a controlled experiment to determine the effect of 
teaching certain methods of study in Elementary Science, Beau- 
champ presents data tending to show the very real effect of specific 
attention to this sort of training. Pupils so trained do develop in a 
short time better comprehension, better organization, and greater 
power to apply what they read. The final report in the book is by 
Hanes and McCoy, on the teaching of a unit in English Literature 
(the essay) to classes of boys. It gives one the hope that high school 
instruction in English Literature may some day elude the cynics and 
jokesters by actually producing pupil enjoyment and the voluntary 
reading of literary material. Informal and inductive are the adjectives 
these instructors justly apply to the method which they describe. 

Individual personnel study is regarded by W. C. Reavis, Principal, 
and Elsie M. Smithies, Assistant Principal, as one of their most 
important functions. They present typical case studies and some 
striking data from the results of this kind of work. The completeness 
of their investigations of problem pupils is beyond the reach of most 
public high school principals. One feels, however, after reading this 
matter on constructive student-accounting that we could very well 
be doing much more than we ordinarily do to save high school pupils 
from scholastic disaster. 














A soe ee ee 
mk ’ 


Ibe 2 ? 


382 The Journal of Educational Psychology 


In the introductory chapter of the monograph Professor Morrison 
tells what has been done by way of institutional reorganization to 
effect an economy in time. The elimination of the eighth grade some 
years ago has been followed by the grouping of the seventh grade 
with the high school proper, making a five-year high school. In 
addition to this, by such economies in subject material as Hill and 
Breslich report, it has been made possible for capable students to 
secure more or less junior college credit by the time of graduation. It 
seems possible that the whole period of high school and elementary 
education in these laboratory schools may from now on be shortened 
to 10 years or the equivalent. The possibility of such a saving becomes 
every year more significant to those who are struggling to obtain 
support for public education. 

This whole monograph contains such substantial material, is so 
restrained in its conclusions, and is so provocative of thought that 
anything like a comprehensive review of it is impossible. It deserves 
careful reading by everyone engaged in secondary education. After 
reading it one should feel that conditions in this experimental school 
with its relative abundance of equipment, excellence of teaching force, 
and selected student body remove its procedures from possible dupli- 
cation in public, financially restricted high schools, but rather that 
the results in such an institution may in time come to be appreciated 
by patrons of our public schools enough to induce them to provide 
what is needed for the similar education of the children of their 
own communities. 


Springfield, Illinois. M. H. WILtina. 





2. Industrial Needs and Present Educational Limitations.'\—Again 
Dr. Link has produced a book worth reading. Like his ‘‘ Employment 
Psychology,” his ‘Education and Industry’ not only represents the 
most significant factors in the field so far developed, but also suggests 
new avenues for progress. Fortunately in Dr. Link we have that rare 
combination of a professionally-trained teacher and industrially- 
trained executive—that combination so essential to the solution of 
the problems of education in industry. 

Primarily the book aims to propose a plan for the realization of the 
much needed good will between the employee and the employer; 


1 Link, Henry C.: “Education and Industry.” The Macmillan Company, 
New York, 1923, pp. XV + 265. ' 








and, 
the 7 
the f 
in tl 
trad 
depa 
the s 
indu 
this 
from 


won 
coul 
sour 
Uni 
Nev 
Lab 
Whi 
orgs 
muc 
the 

mig 
we 

don 


of 
edu 


No 


me 
Scie 
suk 
the 
of 

wr 


Pp. 





New Publications 383 


and, secondly, to lead the way in showing how to train workers in 
the various endeavors which are so vital toindustry. To accomplish 
the first end the author thinks general education of the worker defined 
in the broadest sense possible is essential; while vestibule schools, 
trade schools of the most modern nature for the executives, foremen, 
department heads, salesmen, etc., are necessary for the realization of 
the second ideal. The community as a whole as well as the individual 
industries must cooperate to accomplish this double purpose. And 
this is only fair since the community as well as the industry benefits 
from an improved education. 

The thoroughness of the book is such that the reader is tempted to 
wonder at the omission of certain matter of interest which of course 
could hardly have been unknown to Dr. Link—such as the important 
source of education offered by the workers themselves—the Labor 
University, in the city of Boston, the Rand school, and Labor Temple, in 
New York City, educational endeavors of the American Federation of 
Labor, and the Amalgamated Clothing Workers of America, etc., etc. 
While it is true, that of the educational facilities supplied by the 
organizations of laborers, a great deal is propaganda, nevertheless, 
much is not. Just what Dr. Link has done to separate the good from 
the bad in the educational methods fostered ,\by the employer, he 
might have done for the educational facilities supplied by labor, and 
we regret his not having done this, because he would certainly have 
done it well. 

The book as a whole, I am sure, cannot be ignored by any student 
of industrial education and the specialist in the general field of 
education will find Dr. Link’s suggestions greatly stimulating. 


Northwestern University. A. J. SNow. 





3. A Summary of Educational Psychology..—Because of the tre- 
mendous strides which have been made in the last few decades in the 
science of Educational Psychology, the writer of a textbook in the 
subject is faced with a serious dilemma. If he treats each topic with 
the detail it deserves, his book is in danger of assuming the proportions 
of an encyclopedia; if he attempts to summarize, he is in danger of 
writing a treatise which is on the one hand too technical for the 





‘Mead, A. R.: “Learning and Teaching.” Lippincott Publishing Co., 1923 
pp. 277. 


<a RA ETI OI A POT I A ag A OA 


) 
: 
ve 
: 
Fe | 
* 





> es + 5 
ee ee 


— ens 
Sent ee Sr ; 


ey 


BY eee 


tg ee ee 


wee F 








384 The Journal of Educational Psychology 


beginner and on the other, too sketchy for the advanced student 
The writer of ‘Learning and Teaching” has fallen into the latter error 
The book attempts to cover every topic of importance which wouk 
be included in his title; the result is a serious curtailment of detz 
on important subjects. For example, in the summary of a sectio 
of a chapter on page 16, he indicates that he has covered the topic 
“the relation between drill and habit formation.’”’ This importar 
topic is treated in but six lines on page 15. The entire subject o 
“Original Nature and Education” is handled in eight pages of tex 

The topic of fatigue is left out of the text entirely, but a series o 
difficult questions on the subject is included with references to Thorn 
dike, Starch and others. Possibly it would have been wiser to hav 
said nothing about the topic if it could not be treated fairly adequatel 
in the text, 

The questions, exercises and experiments at the end of each chapte 
seem to the writer to be excellent material for psychology classes 
This material is unusually voluminous, and in assigning subjects fo 
study, and class and group discussions the psychology teacher wi 
find it of great help. 


EpwWINn H. REEDER. 


Teachers College, 
N. Y. C. 





ORIEL Esty tea, 





